Contrastive learning teaches a model that two views of the same image should be close in representation space, and views of different images should be far apart. The details of how you enforce this determine everything.
I-JEPA applies the JEPA idea to images: predict the representations of target patches from a context region, without any view-level augmentations. The result transfers better to semantic tasks than pixel-level methods.
MAE predicts pixels. Contrastive methods match views. JEPA predicts representations of target regions from context regions — in an abstract space where irrelevant details have already been discarded.
Mask 75% of an image's patches. Train a model to reconstruct them. The result is a rich visual representation — and the recipe works because pixels are redundant and structure is not.
V-JEPA extends JEPA to video: predict the representations of future or masked frames from context frames. No pixel reconstruction, no contrastive loss — just abstract prediction across time.
Supervised learning requires labels. Labels require humans. At scale, that's the bottleneck. Self-supervised learning sidesteps it by constructing supervision from the data itself.
JEPA is a learning architecture. World models are the goal it points toward — internal simulators that can predict the consequences of actions and support planning without interacting with the real world.