I-JEPA applies the JEPA idea to images: predict the representations of target patches from a context region, without any view-level augmentations. The result transfers better to semantic tasks than pixel-level methods.
MAE predicts pixels. Contrastive methods match views. JEPA predicts representations of target regions from context regions — in an abstract space where irrelevant details have already been discarded.
V-JEPA extends JEPA to video: predict the representations of future or masked frames from context frames. No pixel reconstruction, no contrastive loss — just abstract prediction across time.
JEPA is a learning architecture. World models are the goal it points toward — internal simulators that can predict the consequences of actions and support planning without interacting with the real world.