Keywords: Modality Fatigue; Activation Decay Detection; Dynamic Switching Control; Self-Healing Framework; Cross-Modal Processing Degradation
TL;DR: We discover and address 'modality fatigue' in deployed multimodal models through a lightweight plug-and-play framework that monitors cross-modal activation decay and applies dynamic compensation to maintain processing balance over time.
Abstract: In long-context multimodal reasoning, models often begin to "burn out"—not because of architectural flaws, but because one or more input modalities gradually lose their expressiveness. *Is this merely an attention failure? Or is the modality itself fatigued?* We propose a new perspective on this degradation: **Modality Fatigue**, a phenomenon where the model's activation and responsiveness to certain modalities decay over time, manifesting as attention attenuation, fusion drift, semantic shift, and loss of task sensitivity. Unlike prior approaches that focus on modeling inter-modal attention patterns or equipment graphs, we shift the lens to the evolving internal state of each modality. We conceptualize modality fatigue as a dynamic decline in each modality’s “vital sign,” modeled through its activation signal trajectory. Concretely, we introduce the **Modality Activation Decay Detector (MAD)** to monitor each modality’s instantaneous activation $\alpha_m(t)$ and its change rate $\delta_m(t)$, while dynamically computing a fatigue-triggering threshold $\tau_m(t)$ from historical trends. Once fatigue is detected, the **Modality Alternation & Compensation Controller (MAC)** adaptively adjusts the fusion path and recall compensation. It controls the integration of current perception and retrieved memory via a learnable gate $\lambda_m(t)$, thereby restoring under-utilized modality signals. Our method sidesteps the need for full attention matrices or inter-modal graph modeling. Instead, it decomposes modality state tracking into independent one-dimensional activation curves, enabling lightweight monitoring and fine-grained control with high interpretability. Across various long-context benchmarks, our framework demonstrates encouraging capabilities in preserving modality balance, enhancing fusion robustness, and mitigating information drift and omission. By uncovering and addressing modality fatigue through transparent, signal-based modeling, we take a step toward building multimodal systems that can perceive their own internal states and adapt accordingly.
Primary Area: interpretability and explainable AI
Submission Number: 13774
Loading