This paper presents JESAC, a JEPA-aided self-enabled ISAC framework designed to enhance the performance of Semantic Communication (SC) systems by leveraging cross-modal information sharing and RF-based sensing. Traditional SC systems often face reconstruction errors due to limited contextual understanding and modality restrictions. JESAC addresses these issues by integrating Joint Embedding Predictive Architecture (JEPA) and Integrated Sensing and Communication (ISAC). JEPA plays a central role in compressing raw sensor data into semantic embeddings, facilitating cross-modal information sharing. Meanwhile, ISAC uses naturally available Radio Frequency (RF) data in a communication system, such as Channel State Information (CSI), as an additional sensing modality. In JESAC, RF data are encoded and aggregated with the extracted semantic embeddings, enriching the context for more accurate raw-data reconstruction. For performance evaluation, we conducted experiments using off-the-shelf cameras and WiFi devices to collect multi-modal datasets combining camera images and CSI data. Image reconstruction experiments using the multimodal datasets show that integrating CSI data improves image quality, outperforming single-modal I-JEPA baselines. Furthermore, JESAC enhances reconstruction accuracy with minimal increase in model complexity, effectively balancing computational cost and performance.