
Figure 2 Joint-Embedding Predictive Architectures are trained to predict the representation of an input y from the representation of another input x. The additional variable z provides the predictor with information about the transformation that computes y from x. Our goal is to explore the effectiveness of feature prediction as a stand-alone objective for learning visual representations from video. To that end, we use a joint-embedding predictive architecture (JEPA); see Figure 2. The main idea behind a JEPA is to learn by predicting the representation of an input y from the representation of another input x. The basic architecture is made up of an encoder, Eθ(·), which computes the representation of the inputs, and a predictor, Pϕ(·), which predicts the representation of y from the representation of x, conditioned on a variable z indicating the transformation (or corruption) between x and y. Conditioning on z enables the generation of distinct predictions for various transformations of x.
V-JEPA is a non-generative model that learns by predicting missing or masked parts of a video in an abstract representation space. This is similar to how Image Joint Embedding Predictive Architecture (I-JEPA) compares abstract representations of images (rather than comparing the pixels themselves). Unlike generative approaches that try to fill in every missing pixel, V-JEPA has the flexibility to discard unpredictable information, which leads to improved training and sample efficiency by a factor between 1.5x and 6x.
Why V-JEPA for Intelligent Distributed Sensing?
What is the (Simplified) Distributed V-JEPA we Propose?
Justification and Rationale Behind Simplifying V-JEPA
Provide the motivation for simplifying V-JEPA, such as reducing computational complexity, enabling cross-device knowledge sharing, and making it adaptable to real-world distributed systems without sacrificing its predictive capability.
With V-JEPA, we mask out a large portion of a video so the model is only shown a little bit of the context. We then ask the predictor to fill in the blanks of what’s missing—not in terms of the actual pixels, but rather as a more abstract description in this representation space.



