Contrastive learning teaches a model that two views of the same image should be close in representation space, and views of different images should be far apart. The details of how you enforce this determine everything.
I-JEPA applies the JEPA idea to images: predict the representations of target patches from a context region, without any view-level augmentations. The result transfers better to semantic tasks than pixel-level methods.
Mask 75% of an image's patches. Train a model to reconstruct them. The result is a rich visual representation — and the recipe works because pixels are redundant and structure is not.