Keywords: Unified Multimodal Learning, Unified Multimodal Generation, Efficient Transformers, Hybrid Attention
Abstract: Unified multimodal learning requires attention mechanisms that are both efficient and expressive.
Softmax attention provides strong modeling capacity but suffers from quadratic complexity, while
linear attention achieves near-linear efficiency at the cost of weaker expressivity. We identify two
major expressivity challenges in efficient unified multimodal models: (\emph{i}) modality imbalance,
where dominant signals suppress weaker modalities during fusion, and (\emph{ii}) loss of global context,
as efficient variants tend to over-smooth long sequences. We propose \textbf{Gated Hybrid Attention (GHA)},
a multimodal-specialized operator that augments linear attention with (\emph{i}) a selective gating mechanism
to balance modality contributions and stabilize training, and (\emph{ii}) agent-token softmax aggregation
to restore adaptive global context while preserving near-linear complexity. To demonstrate generality, we validate GHA in two representative paradigms: autoregressive-only(AR-only)
and autoregressive+diffusion(AR+Diffusion). In both settings, GHA consistently improves multimodal alignment,
long-context retention, and efficiency over comparable Transformer and efficient attention baselines.
These cross-paradigm results highlight that GHA functions as a plug-and-play building block, offering
a lightweight and extensible approach that is orthogonal to scaling trends and modality complexity.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 15287
Loading