Through Diverse Lenses: Multimodal Collaborative Perception for Indoor Scenes in Smart Home Systems

Published: 2025, Last Modified: 06 Jan 2026IEEE Internet Things J. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The confluence of Internet-of-Things (IoT) and artificial intelligence has advanced smart home (SH) systems, enabling the provision of complex scene-aware services. Central to these services is the precise perception of the indoor environment. Indoor scenes present unique challenges due to diverse layouts, frequent object occlusions, and dynamic human activities, which hinder the comprehensive understanding by individual SH devices/sensors. Moreover, the diversity of sensors equipped by SH devices introduces the multimodal data issue, necessitating the reconciliation of discrepancies among various data modalities. This article presents multimodal collaborative perception (MMCP), a collaborative perception paradigm for SH systems with multimodal raw data. MMCP leverages the intermediate collaboration framework and tailors it to an edge-assisted SH system. It deploys dedicated encoders at SH devices to convert multimodal raw data to uniform intermediate features, which are then sent to an edge computing box for aggregation and perception. MMCP introduces a critical information identifier to selectively transmit informative parts within intermediate features, thereby mitigating the communication overhead for bandwidth-constrained SH devices. Moreover, MMCP designs collaborative infomax (CIM) to facilitate intermediate feature aggregation. CIM defines multiview mutual information (MVMI) to capture dependencies between the aggregated feature and individual intermediate features from multiple SH devices. It employs contrastive learning to estimate and maximize MVMI in an unsupervised manner, such that the aggregated feature can retain discriminative information from individual intermediate features. We evaluate MMCP in four real-world indoor scene datasets. Experimental results show that MMCP outperforms noncollaborative strategy by 18% in average precision (AP). Particularly, MMCP strikes a favorable balance between perception performance and communication overhead, compressing intermediate features to a ratio of 13% while maintaining higher AP compared to state-of-the-art methods.
Loading