MAE predicts pixels. Contrastive methods match views. JEPA predicts representations of target regions from context regions — in an abstract space where irrelevant details have already been discarded.
V-JEPA extends JEPA to video: predict the representations of future or masked frames from context frames. No pixel reconstruction, no contrastive loss — just abstract prediction across time.
JEPA is a learning architecture. World models are the goal it points toward — internal simulators that can predict the consequences of actions and support planning without interacting with the real world.