JEPA World Models: Innovative Predictive Learning Across Images, Video, and Agents

Srijon Mandal
Nov 23, 2025
4 min read

Joint-Embedding Predictive Architectures (JEPAs) are a family of models that learn by predicting high-level features rather than pixels. They unify image-based learning (I-JEPA), video-based learning (V-JEPA), and general predictive world models for autonomous agents.

If you zoom out a bit, modern self‑supervised vision methods mostly fall into two categories:

Invariance-based: given two augmented views of the same image, force the encoder to produce almost identical embeddings, and push different images apart.
Reconstruction-based: hide part of the input and ask a model to reconstruct the missing pixels or tokens.

Joint‑Embedding Predictive Architectures (JEPAs) take a different route:

Don’t force everything to be invariant. Don’t waste capacity on pixel-perfect reconstructions.
Instead, predict one representation from another.

That single shift — predicting features rather than pixels — is enough to give you scalable, semantic models for images, videos, and eventually full‑blown world models for agents.

This post walks through the core idea, then dives into I‑JEPA (images), V‑JEPA (video), and how they plug into a larger “world model + actor” architecture.

1. What is a JEPA?

A JEPA predicts the abstract representation of one signal from the representation of another compatible signal. Unlike pixel-reconstruction or contrastive methods, JEPAs learn semantic, stable, and scalable features.

*EPA: Predicting Representations, Not Pixels*

At heart, a JEPA is very simple. You give it two related signals, `x` and `y` — for example:

two regions from the same image,
two parts of the same video,
or current state and future state of some system.

Then you do three things:

Encode each input into a representation.
- `Enc_x(x) → s_x`
- `Enc_y(y) → s_y`
Feed `s_x` through a predictor, optionally with some side information `z` describing what’s missing or how `y` is related to `x`.
Train the system so that the predictor’s output matches `s_y` in feature space.

2. I‑JEPA: Learning from Images

I‑JEPA predicts target block representations from a context block without using handcrafted augmentations. It masks large blocks to ensure targets are semantic and uses only visible context patches for efficiency.

3. V‑JEPA: Learning from Video

V‑JEPA extends feature prediction to video. It tokenizes spatiotemporal volumes, masks large 3D blocks, and predicts their feature representations. This enables strong motion and appearance understanding with a single backbone.

Architecture in a nutshell:

Take a video of `T` frames at resolution `H×W`.
Cut it into 3D patches (tokens) — e.g. 16×16 pixels across 2 frames each.
Use a ViT‑style backbone over this sequence of tokens.
- x‑encoder (context)
  - Drop a large fraction of tokens according to a spatio‑temporal mask (around 90% masked on average).
  - Feed the remaining tokens into the context encoder and get features for each visible token.
- y‑encoder (targets)
  - Run the full video (all tokens) through an EMA copy of the encoder.
  - At the output, keep only the embeddings for the masked tokens — those are the targets.
- Predictor
  - Concatenate context features with a set of learnable mask tokens whose positional embeddings encode the location of each missing patch.
  - Run them through a narrow transformer to predict a feature for every mask token.
Loss
- Apply an L1 loss between predicted features and the y‑encoder features, with gradients stopped on the targets (BYOL‑style).
Efficient Masking
- Videos are highly redundant in both space and time. V‑JEPA uses a clever “multi‑block” mask:
Short‑range: union of several small blocks covering ~15% of each frame.
Long‑range: union of a couple of large blocks covering ~70% of each frame.
Together they mask around 90% of the tokens.
image
4. Predictive World Models for Agents
Extending JEPA principles to embodied agents results in hierarchical predictive world models. These combine perception, short-term memory, an actor, and a cost module to imagine future states and plan actions.
Yann LeCun’s “Path Towards Autonomous Machine Intelligence” argues that a capable agent needs at least four big pieces:

1. Perception – turns raw sensory input into a state representation.
2. World model – predicts how the world (in that representation space) evolves, and fills in missing details.
3. Cost / critic – evaluates how good or bad a situation is, based on hard‑wired drives and learned value estimates.
4. Actor – proposes and refines action sequences to minimize long‑term cost.
image
5. Why Predict Features Instead of Pixels?
Predicting in representation space avoids wasting compute on pixel-level noise and instead focuses on predictable, meaningful structure. This leads to higher semantic quality, better transferability, and more scalable training.
The proposal is to build the world model largely out of hierarchical JEPAs (H‑JEPA):
- At low levels, they operate over short horizons and fine spatial details.
- Higher levels work with abstracted states (objects, goals, plans) and longer horizons.
- All levels use essentially the same objective: predict the representation of future or missing states from current ones, in feature space, possibly conditioned on action sequences.
Planning in this setup looks like model‑predictive control:
- The actor proposes a sequence of high‑level actions.
- The world model (an H‑JEPA stack) rolls forward multiple possible future trajectories in latent space.
- The cost module scores those trajectories.
- Gradients (or other search methods) are used to nudge the action sequence toward lower predicted cost.
- Only the first action (or first few) are executed; the process repeats.
V-JEPA feature prediction beats pixel reconstruction by about +5 points (roughly 73.7% vs 68.6%) on Kinetics‑400.
Once you have that loop, JEPA is no longer “just” a pretraining trick; it becomes the engine that lets an agent imagine, evaluate, and choose between futures, without ever reconstructing raw pixels.

6. Conclusion
JEPAs present a new direction for self-supervised learning—one that avoids pixel reconstruction, minimizes handcrafted augmentations, and unifies perception and prediction across modalities. They provide a foundation for scalable vision models and future autonomous systems.