Semantic Communication (SC) has emerged as a transformative approach for data transmission, particularly in distributed sensing scenarios requiring efficiency and scalability. Unlike traditional communication that transmits raw data, in SC, sensory data is first transformed into latent embeddings via the encoder. After that, these embeddings are modulated into Radio Frequency (RF) signals for wireless transmission. Eventually, the received signals are demodulated back to embeddings and then decoded back to the original sensory data. This ability to condense and communicate semantic information holds great promise for applications such as autonomous driving and environmental monitoring, where distributed multimodal sensor networks are used to gather vast amounts of sensory data \cite{10333738, 10049005}.
Despite its advantages, SC systems face a key challenge that limit their effectiveness in distributed sensing applications: SC approaches treat sensing data of each modality independently, which are unable to effectively integrate or fuse semantic information across modalities, resulting in lost opportunities for improving sensing accuracy \cite{10233481, 9877924, 10330577}. This issue becomes particularly evident in scenarios such as blockages to camera views where vital visual information might be obscured.
\begin{figure} \centering \includegraphics[scale=0.53]{figs/workflow.pdf} \caption{Comparison between the workflows of traditional SC and JESAC systems.} \label{fig:comparison} \end{figure}
To mitigate this challenge, we propose JESAC, a self-enabled Integrated Sensing and Semantic Communication framework powered by a central multimodal Joint Embedding Predictive Architecture (JEPA). JEPA is a self-supervised learning model designed to predict target information in the latent (semantic) space by learning from context inputs, instead of generating data at the pixel or token level \cite{assran2023self}. JESAC leverages existing RF signals as additional sensing modalities to enhance its contextual understanding. These signals naturally propagate through the target sensing area during normal communication, inherently carry valuable contextual information about objects and surroundings. By integrating both the embeddings of sensory data and RF signals into a multimodal JEPA, JESAC significantly enhances sensing accuracy without adding hardware costs. Moreover, centralized JEPA processing facilitates efficient cross-modal semantic sharing from wireless to other modalities without the need for labeled training samples, further streamlining the system.
Our contributions are summarized as follows: