The most important assumption for this work is a distributed multimodal sensing network (DMSN) where multimodal sensors are required to communicate to conduct sensing tasks within a common area. We may get lost once we forget/ignore this setting (or precondition). The JEPA-enabled cross-modality semantic sharing would become meaningful if and only if within a distributed multimodal sensing network, thus the storyline must start from it. In contrast, in typical semantic comm or ISAC systems, there may be little or even no need for cross-model info sharing (especially for SC, otherwise it cannot be a ‘communication’ process anymore), thus they can be methodologies but should never be our main focuses (at lease when trying to make our minds up).
Another reason why we got confused is that we mixed the following concepts: semantic communication & JEPA-enabled cross-modal semantic transition or sharing (e.g., RF→ vision), but they are initially independent processes. What is more, we somehow unconsciously realized the above two concepts by only using RF signals coincidentally, (that is a isac actually), somehow aggravating our confusion. In our work, they become connected with the help of JEPA, as the JEPA-enabled cross-modal semantic sharing can hugely decrease the amount of transmitted data.
In applications such as autonomous driving, smart city services, and environmental monitoring, it is essential to gather sensory information from a Distributed Multimodal Sensing Network (DMSN) comprising various sensors such as LiDAR, cameras, and audio. In these systems, sensory data needs to be transmitted to the cloud or central servers for downstream tasks like prediction, scene reconstruction, or decision-making. Additionally, distributed nodes must often communicate with each other to collaborate and synergize. As a result, a typical DMSN requires both sensing and data transmission simultaneously, making it a natural multimodal Integrated Sensing and Communication (ISAC) framework.
However, the immense communication traffic and computation involved in such networks poses significant challenges, and how to mitigate these issues is a critical concern. Deep Learning (DL)-based Semantic Communication (SC) has been introduced as a solution to transform raw data into compressed latent representations (embeddings) for transmission, easing traffic and computational load. Yet, SC faces several limitations in large-scale DMSN environments:
To address these challenges, we propose JEPA-ISAC, a JEPA-aided Multimodal ISAC framework designed to enhance the efficiency and intelligence of DMSN. Our framework leverages JEPA (Joint Embedding Predictive Architecture) to reduce communication traffic and improve sensing accuracy by facilitating cross-modal semantic sharing and transition.
JEPA is a self-supervised learning architecture designed to predict missing information in the latent (semantic) space rather than at the pixel or token level. Unlike generative methods that reconstruct data at a detailed level, JEPA learns semantic representations—high-level abstractions of the data that focus on capturing core contextual meaning while ignoring unnecessary details. This makes JEPA ideal for multimodal fusion and cross-modal data reconstruction, as it allows the system to operate with compressed semantic embeddings rather than raw data. By incorporating JEPA into the ISAC framework, we solve several key issues inherent to traditional SC in multimodal sensing: