WIRELESS communication systems can benefit from peripheral data source information in addition to the radio frequency (RF) signal domain, such as location, motion sensory data, and camera images [1]–[5]. Incorporating these non-RF modalities can complement insufficient features in RF signals, enabling more accurate handover decisions [3], received power predictions [5], and so on. In view of this, in this letter we focus on the problem of millimeter-wave (mmWave) uplink received power prediction by efficiently integrating the received mmWave RF signal powers and depth camera images.

As shown by a prior work [5], depth image-based prediction exploiting machine learning (ML) reaches better accuracy by recognizing mobility blockage patterns to detect sudden changes between line-of-sight (LoS) and non-LoS conditions, which is hardly observable from received mmWave signal powers. By contrast, current received mmWave signal powers are useful for predicting short-term received power fluctuations for a given LoS or NLoS condition [6]. To reach their full potential, our goal is to fuse both RF received powers and depth images in an ML-based received power prediction.

There are two key challenges in acquiring depth images: communication latency and privacy violation. The first chal- lenge is due to the fact that depth images are not necessarily obtained in the same location of the RF received power. The physical separation necessitates communication between the entity holding the images (e.g., user equipment (UE) or surveillance cameras) and that holding RF received powers (e.g., base stations (BSs)) over a limited wireless bandwidth, and this can cause a severe latency in the collection of depth images. However, numerous applications for mmWave communications are delay-sensitive (e.g., virtual reality [7]). Hence, it is important to design a prediction framework with lower communication latency for acquiring depth images. The second challenge is due to the fact that depth images may also involve privacy-sensitive information, e.g., the travel history of people in the view of cameras. Therefore, acquiring raw depth images violates the privacy of the pedestrians who block mmWave links, which motivates us to design a framework to perform received power prediction in a privacy-preserving manner.

To address the aforementioned challenges, we propose a communication-efficient and privacy-preserving multimodal split learning (MultSL) framework. Exploiting a split NN architecture [8], without sharing raw data, MultSL combines RF and image modalities by only exchanging NN activations and gradients (Fig. 1(c)). Before exchanging NN activations, the last activations for the image modality are compressed (see Fig. 2), achieving higher communication efficiency while preserving more data privacy. Surprisingly, experimental evaluations show that the compression is beneficial for balancing the fusion between RF and image modalities. Consequently, the MultSL with an optimal compression rate achieves higher accuracy, compared not only to baseline schemes based solely on either received mmWave powers (RF, Fig. 1(a)) or images (Img, Fig. 1(b)), but also to the MultSL without compression.

Related Works. For handover or positioning, RF-based received power or channel state informaiton are utilized [9], [10]. For mmWave received power prediction, the prior study in [5] utilizes camera images. While the aforementioned works consider a single modality, the proposed MultSL utilizes both image and RF modalities for mmWave received power prediction, thereby achieving higher accuracy. Moreover, while the study in [5] does not take into account a communication efficiency and privacy in gathering images, MultSL integrates image and RF modalities in a communication-efficient and privacy-preserving manner by leveraging a novel split learning (SL) framework. The original SL framework in [8] combines NN activations and gradients that are generated from a single modality without exchanging raw data. In [11], to improve privacy guarantees in SL, the split NN is optimized to maximize the KL divergence between the distributions of raw health data and NN activations. The aforementioned works focus on a single modality and do not consider communication efficiency. In contrast, MultSL integrates NN activations originated from two different modalities and optimizes the split NN by compressing the last activations of depth images; thereby improving communication efficiency and privacy guarantees.