When Multimodal Models “Burn Out”: Diagnosing and Healing Modality Fatigue via MAD + MAC

When Multimodal Models “Burn Out”: Diagnosing and Healing Modality Fatigue via MAD + MAC

ICLR 2026 Conference Submission13774 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Modality Fatigue; Activation Decay Detection; Dynamic Switching Control; Self-Healing Framework; Cross-Modal Processing Degradation

TL;DR: We discover and address 'modality fatigue' in deployed multimodal models through a lightweight plug-and-play framework that monitors cross-modal activation decay and applies dynamic compensation to maintain processing balance over time.

Abstract: In long-context multimodal reasoning, models often begin to "burn out"—not because of architectural flaws, but because one or more input modalities gradually lose their expressiveness. *Is this merely an attention failure? Or is the modality itself fatigued?* We propose a new perspective on this degradation: **Modality Fatigue**, a phenomenon where the model's activation and responsiveness to certain modalities decay over time, manifesting as attention attenuation, fusion drift, semantic shift, and loss of task sensitivity. Unlike prior approaches that focus on modeling inter-modal attention patterns or equipment graphs, we shift the lens to the evolving internal state of each modality. We conceptualize modality fatigue as a dynamic decline in each modality’s “vital sign,” modeled through its activation signal trajectory. Concretely, we introduce the **Modality Activation Decay Detector (MAD)** to monitor each modality’s instantaneous activation $\alpha_m(t)$ and its change rate $\delta_m(t)$, while dynamically computing a fatigue-triggering threshold $\tau_m(t)$ from historical trends. Once fatigue is detected, the **Modality Alternation & Compensation Controller (MAC)** adaptively adjusts the fusion path and recall compensation. It controls the integration of current perception and retrieved memory via a learnable gate $\lambda_m(t)$, thereby restoring under-utilized modality signals. Our method sidesteps the need for full attention matrices or inter-modal graph modeling. Instead, it decomposes modality state tracking into independent one-dimensional activation curves, enabling lightweight monitoring and fine-grained control with high interpretability. Across various long-context benchmarks, our framework demonstrates encouraging capabilities in preserving modality balance, enhancing fusion robustness, and mitigating information drift and omission. By uncovering and addressing modality fatigue through transparent, signal-based modeling, we take a step toward building multimodal systems that can perceive their own internal states and adapt accordingly.

Primary Area: interpretability and explainable AI

Submission Number: 13774

Loading