Version 6 | Notion

Semantic Communication (SC) has emerged as a transformative approach for data transmission, particularly in distributed sensing scenarios requiring efficiency and scalability. Unlike traditional communication that transmits raw data, in SC, sensory data is first transformed into latent embeddings via the encoder. After that, these embeddings are modulated into Radio Frequency (RF) signals for wireless transmission. Eventually, the received signals are demodulated back to embeddings and then decoded back to the original sensory data. This ability to condense and communicate semantic information holds great promise for applications such as autonomous driving and environmental monitoring, where distributed multimodal sensor networks are used to gather vast amounts of sensory data \cite{10333738, 10049005}.

Despite its advantages, SC systems face a key challenge that limit their effectiveness in distributed sensing applications: SC approaches treat sensing data of each modality independently, which are unable to effectively integrate or fuse semantic information across modalities, resulting in lost opportunities for improving sensing accuracy \cite{10233481, 9877924, 10330577}. This issue becomes particularly evident in scenarios such as blockages to camera views where vital visual information might be obscured.

\begin{figure} \centering \includegraphics[scale=0.53]{figs/workflow.pdf} \caption{Comparison between the workflows of traditional SC and JESAC systems.} \label{fig:comparison} \end{figure}

To mitigate this challenge, we propose JESAC, a self-enabled Integrated Sensing and Semantic Communication framework powered by a central multimodal Joint Embedding Predictive Architecture (JEPA). JEPA is a self-supervised learning model designed to predict target information in the latent (semantic) space by learning from context inputs, instead of generating data at the pixel or token level \cite{assran2023self}. JESAC leverages existing RF signals as additional sensing modalities to enhance its contextual understanding. These signals naturally propagate through the target sensing area during normal communication, inherently carry valuable contextual information about objects and surroundings. By integrating both the embeddings of sensory data and RF signals into a multimodal JEPA, JESAC significantly enhances sensing accuracy without adding hardware costs. Moreover, centralized JEPA processing facilitates efficient cross-modal semantic sharing from wireless to other modalities without the need for labeled training samples, further streamlining the system.

Our contributions are summarized as follows:

We introduce JESAC, a central JEPA-aided Integrated Sensing and Semantic Communication framework that leverages freely available RF signals as additional modalities. This enhancement boosts the system's sensing capabilities without the need for extra hardware or well-labeled training samples.
To experimentally validate the effectiveness of JESAC, we established a dual-camera observation system supplemented by commercially available cameras and Wi-Fi devices. This system captures continuous image flows and processes them into embeddings, which are then transmitted to a central base station (BS) via Wi-Fi signals. Our evaluation focuses on image reconstruction accuracy and the data volume managed during wireless communication. The results show that JESAC significantly outperforms traditional single-modality SC baselines, highlighting its superior performance in real-world SC scenarios.
Our results also revealed a trade-off between model complexity and performance within JESAC. While simpler CNN-based JEPA struggled with accurate reconstruction, more sophisticated ViT-based JEPA excelled but at the expense of higher computational demands. By incorporating RF modalities, JESAC effectively strikes a balance, enhancing accuracy with only a modest increase in model complexity. This makes JESAC both efficient and scalable for data reconstruction.