Guided Masked Self-Distillation Modeling for Distributed Multimedia Sensor Event Analysis

Masahiro Yasuda, Noboru Harada, Yasunori Ohishi, Shoichiro Saito, Akira Nakayama, Nobutaka Ono

Published: 2026, Last Modified: 07 May 2026ACM Trans. Multim. Comput. Commun. Appl. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This article addresses a new task: distributed multimedia sensor event analysis (DiMSEA). DiMSEA aims to analyze a series of human and machine activities (called “events” in this article) in complex and extensive real-world environments. Since an observation from a single sensor is often missing or fragmented in such an environment, observations from multiple locations and modalities should be integrated to analyze events comprehensively. However, a learning method has yet to be established to extract joint representations that effectively combine such distributed observations. Therefore, we propose guided masked self-distillation modeling (Guided-MELD) for inter-sensor relationship modeling. The basic idea of Guided-MELD is to learn to supplement the information from the masked sensor with information from other sensors needed to detect the event. Guided-MELD is expected to effectively distill fragmented target event information from sensors without over-relying on any specific sensors. To validate the effectiveness of the proposed method in DiMSEA, we recorded two new datasets: MM-Store and MM-Office. These datasets consist of human activities in a convenience store and an office, recorded using distributed cameras and microphones. Experimental results show that the proposed Guided-MELD improves event tagging and detection performance and outperforms conventional inter-sensor relationship modeling methods. Furthermore, the proposed method performed robustly even when sensors were reduced.

External IDs:dblp:journals/tomccap/YasudaHOSNO26