Abstract: This article addresses a new task: distributed multimedia sensor event analysis (DiMSEA). DiMSEA aims to analyze a series of human and machine activities (called “events” in this article) in complex and extensive real-world environments. Since an observation from a single sensor is often missing or fragmented in such an environment, observations from multiple locations and modalities should be integrated to analyze events comprehensively. However, a learning method has yet to be established to extract joint representations that effectively combine such distributed observations. Therefore, we propose guided masked self-distillation modeling (Guided-MELD) for inter-sensor relationship modeling. The basic idea of Guided-MELD is to learn to supplement the information from the masked sensor with information from other sensors needed to detect the event. Guided-MELD is expected to effectively distill fragmented target event information from sensors without over-relying on any specific sensors. To validate the effectiveness of the proposed method in DiMSEA, we recorded two new datasets: MM-Store and MM-Office. These datasets consist of human activities in a convenience store and an office, recorded using distributed cameras and microphones. Experimental results show that the proposed Guided-MELD improves event tagging and detection performance and outperforms conventional inter-sensor relationship modeling methods. Furthermore, the proposed method performed robustly even when sensors were reduced.
External IDs:dblp:journals/tomccap/YasudaHOSNO26
Loading