Efficient Unified Multimodal Understanding and Generation with Gated Hybrid Attention

Efficient Unified Multimodal Understanding and Generation with Gated Hybrid Attention

ICLR 2026 Conference Submission15287 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Unified Multimodal Learning, Unified Multimodal Generation, Efficient Transformers, Hybrid Attention

Abstract: Unified multimodal learning requires attention mechanisms that are both efficient and expressive. Softmax attention provides strong modeling capacity but suffers from quadratic complexity, while linear attention achieves near-linear efficiency at the cost of weaker expressivity. We identify two major expressivity challenges in efficient unified multimodal models: (\emph{i}) modality imbalance, where dominant signals suppress weaker modalities during fusion, and (\emph{ii}) loss of global context, as efficient variants tend to over-smooth long sequences. We propose \textbf{Gated Hybrid Attention (GHA)}, a multimodal-specialized operator that augments linear attention with (\emph{i}) a selective gating mechanism to balance modality contributions and stabilize training, and (\emph{ii}) agent-token softmax aggregation to restore adaptive global context while preserving near-linear complexity. To demonstrate generality, we validate GHA in two representative paradigms: autoregressive-only(AR-only) and autoregressive+diffusion(AR+Diffusion). In both settings, GHA consistently improves multimodal alignment, long-context retention, and efficiency over comparable Transformer and efficient attention baselines. These cross-paradigm results highlight that GHA functions as a plug-and-play building block, offering a lightweight and extensible approach that is orthogonal to scaling trends and modality complexity.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 15287

Loading