There is a technique from META called JEPA (joint embedding predictive architecture) which has been making lots of noise recently. It can be seen as inpaiting-type technique but in latent space (not pixel space). This blog of Yann Lecun will quickly explain it (Paper is somewhere there too) https://ai.meta.com/blog/yann-lecun-ai-model-i-jepa/
Check its relevance to our distributed multimodal sensing/fusion/communication when applied to different modalities (text, image, video, point cloud).
Datasets are also key without which it is very hard to validate anything.
In short we need to think of distributed brains that need to collectively sense, fuse (inside on brain) and communicate their semantic information. It is by glueing information that the task is solved (a single node/brain cannot).
The above points cover many interesting and new avenues of research. For eg:
Hence a protocol/signalling is also needed. We should explore this (A signalling/prtocol to align modalities across distributed nodes is NOVEL).
The other point is aligning modalities which allows to cross-generate data (for free) inside one brain. You had this in your slide (but perhaps in the centralized setting where all data is available); our setting is distributed and samples from some modalities are missing --> inpainting ideas are useful.
Just to make sure you understood what we discussed yesterday and what i need from you regarding the distributed Multimodal impainting problem (we may need to find a better wording).
Given distributed cameras with partial FoV and CSI values how the performance (task accuracy + other metrics such as model size or energy etc..) improves by: