Preface

The most important assumption for this work is a distributed multimodal sensing network (DMSN) where multimodal sensors are required to communicate to conduct sensing tasks within a common area. We may get lost once we forget/ignore this setting (or precondition). The JEPA-enabled cross-modality semantic sharing would become meaningful if and only if within a distributed multimodal sensing network, thus the storyline must start from it. In contrast, in typical semantic comm or ISAC systems, there may be little or even no need for cross-model info sharing (especially for SC, otherwise it cannot be a ‘communication’ process anymore), thus they can be methodologies but should never be our main focuses (at lease when trying to make our minds up).

Another reason why we got confused is that we mixed the following concepts: semantic communication & JEPA-enabled cross-modal semantic transition or sharing (e.g., RF→ vision), but they are initially independent processes. What is more, we somehow unconsciously realized the above two concepts by only using RF signals coincidentally, (that is a isac actually), somehow aggravating our confusion. In our work, they become connected with the help of JEPA, as the JEPA-enabled cross-modal semantic sharing can hugely decrease the amount of transmitted data.

Introduction

In applications such as autonomous driving, smart city services, and environmental monitoring, it is essential to gather sensory information from a Distributed Multimodal Sensing Network (DMSN) comprising various sensors such as LiDAR, cameras, and audio. In these systems, sensory data needs to be transmitted to the cloud or central servers for downstream tasks like prediction, scene reconstruction, or decision-making. Additionally, distributed nodes must often communicate with each other to collaborate and synergize. As a result, a typical DMSN requires both sensing and data transmission simultaneously, making it a natural multimodal Integrated Sensing and Communication (ISAC) framework.

However, the immense communication traffic and computation involved in such networks poses significant challenges, and how to mitigate these issues is a critical concern. Deep Learning (DL)-based Semantic Communication (SC) has been introduced as a solution to transform raw data into compressed latent representations (embeddings) for transmission, easing traffic and computational load. Yet, SC faces several limitations in large-scale DMSN environments:

Redundant Distributed Nodes: In large DMSN setups (e.g., city-scale), even when using embeddings, the volume of sensory data from multiple nodes can still result in overwhelming traffic. We may have all the nodes from the target area transmit their embeddings to the central server or cloud, but the communication load tends to be considerable. Node selection strategies or protocols can be employed to filter out redundant nodes, but designing a reasonable protocol is technical and expensive, no mention to how adapt it to dynamic environmental changes and various network architecture in real-time.
Lack of Semantic Understanding: Most current SC frameworks do not fully leverage the semantic relationships among multimodal data. Although embeddings are transmitted, they remain isolated from each other in terms of contextual understanding. This limits the ability of the system to explore cross-modal relationships and share semantic information across nodes, reducing its effectiveness in tasks like collaborative decision-making and cross-node (or -modality) data reconstruction.
Decoding Dependency: For specific tasks such as image or video reconstruction, SC requires source and channel encoding/decoding to restore the raw data from embeddings. Without strong background knowledge to ensure semantic consistency, model training becomes difficult. As the number of modalities and sensor types increases, the preparation of such background knowledge becomes more costly and time-consuming, leading to unstable performance and low robustness in the system. In addition, the reconstructed raw data must be preprocessed (e.g., filtering, denoising), further complicating the implementation.
Untapped Potential of Self-enabled RF Sensing in DMSN: Any RF-signal-based communication system inherently has the potential to enhance sensing capabilities, but this aspect is often overlooked in distributed sensing networks. Incorporating ISAC into semantic communication can improve the overall sensing accuracy by utilizing RF signals for sensing tasks.

To address these challenges, we propose JEPA-ISAC, a JEPA-aided Multimodal ISAC framework designed to enhance the efficiency and intelligence of DMSN. Our framework leverages JEPA (Joint Embedding Predictive Architecture) to reduce communication traffic and improve sensing accuracy by facilitating cross-modal semantic sharing and transition.

JEPA is a self-supervised learning architecture designed to predict missing information in the latent (semantic) space rather than at the pixel or token level. Unlike generative methods that reconstruct data at a detailed level, JEPA learns semantic representations—high-level abstractions of the data that focus on capturing core contextual meaning while ignoring unnecessary details. This makes JEPA ideal for multimodal fusion and cross-modal data reconstruction, as it allows the system to operate with compressed semantic embeddings rather than raw data. By incorporating JEPA into the ISAC framework, we solve several key issues inherent to traditional SC in multimodal sensing: