スクリーンショット 2024-12-19 15.01.02.png

Figure 2 Joint-Embedding Predictive Architectures are trained to predict the representation of an input y from the representation of another input x. The additional variable z provides the predictor with information about the transformation that computes y from x. Our goal is to explore the effectiveness of feature prediction as a stand-alone objective for learning visual representations from video. To that end, we use a joint-embedding predictive architecture (JEPA); see Figure 2. The main idea behind a JEPA is to learn by predicting the representation of an input y from the representation of another input x. The basic architecture is made up of an encoder, Eθ(·), which computes the representation of the inputs, and a predictor, Pϕ(·), which predicts the representation of y from the representation of x, conditioned on a variable z indicating the transformation (or corruption) between x and y. Conditioning on z enables the generation of distinct predictions for various transformations of x.

V-JEPA is a non-generative model that learns by predicting missing or masked parts of a video in an abstract representation space. This is similar to how Image Joint Embedding Predictive Architecture (I-JEPA) compares abstract representations of images (rather than comparing the pixels themselves). Unlike generative approaches that try to fill in every missing pixel, V-JEPA has the flexibility to discard unpredictable information, which leads to improved training and sample efficiency by a factor between 1.5x and 6x.

スクリーンショット 2024-12-19 15.04.45.png

スクリーンショット 2024-12-19 13.37.56.png