Keywords: self-supervised learning, masked modeling, structured noise, modality-aware masking, video representation learning, audio representation learning, multimodal learning, masked autoencoders, frequency-based masking, color noise masking
Abstract: Masked modeling has emerged as a robust self-supervised learning framework. However, most methods rely on random masking, which disregards the structural properties of different data modalities. To naturally align with the spatiotemporal and spectral characteristics of video and audio data, we introduce a structured noise-based masking approach. By filtering white noise into different color noise distributions, we generate structured masks that capture modality-specific patterns without requiring handcrafted heuristics or access to the data. Our approach enhances masked video and audio modeling frameworks without any additional computational cost. Experiments show that structured noise masking consistently outperforms random masking, underscoring the value of modality-aware masking strategies for representation learning.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 3768
Loading