JEPA

A first step toward a broadly capable joint-embedding predictive architecture

The idea behind I-JEPA is to predict missing information in an abstract representation that’s more akin to the general understanding people have. Compared to generative methods that predict in pixel/token space, I-JEPA uses abstract prediction targets for which unnecessary pixel-level details are potentially eliminated, thereby leading the model to learn more semantic features. Another core design choice to guide I-JEPA towards producing semantic representations is the proposed multi-block masking strategy. Specifically, we demonstrate the importance of predicting large blocks containing semantic information (with sufficiently large scale), using an informative (spatially distributed) context.

Greater efficiency and strong performance

I-JEPA pretraining is also computationally efficient. It doesn’t involve any overhead associated with applying more computationally intensive data augmentations to produce multiple views. Only one view of the image needs to be processed by the target encoder, and only the context blocks need to be processed by the context encoder.

Empirically, we find that I-JEPA learns strong off-the-shelf semantic representations without the use of hand-crafted view augmentations - see the figure below. It also outperforms pixel and token-reconstruction methods on ImageNet-1K linear probing and semi-supervised evaluation.

A step closer to human-level intelligence in AI

I-JEPA demonstrates the potential of architectures for learning competitive off-the-shelf image representations without the need for extra knowledge encoded through hand-crafted image transformations. It would be particularly interesting to advance JEPAs to learn more general world-models from richer modalities, e.g., enabling one to make long-range spatial and temporal predictions about future events in a video from a short context, and conditioning these predictions on audio or textual prompts.

We look forward to working to extend the JEPA approach to other domains, like image-text paired data and video data. In the future, JEPA models could have exciting applications for tasks like video understanding. This is an important step towards applying and scaling self-supervised methods for learning a general model of the world.

JEPA model opens a mind for ISAC systems. We aim to fuse data semantics, not simply perform multimodal data fusion. Our goal is to generate missing data locally from available modalities or retrieve that missing information from neighboring nodes. JEPA unveils the feasibility of accurate prediction by training deep models using the latent embeddings in a self-supervised learning manner rather than the well-labeled raw data.

Intuitive Sensing