Keywords: MLLM, training-free adaptation
Abstract: Multimodal large language models (MLLMs) suffer from a coordination failure during training—attention heads optimize independently despite sharing inputs, leading many to develop suboptimal specialization patterns.
We identify that numerous attention heads exhibit high downstream influence yet minimal cross-modal interaction, acting as performance bottlenecks that propagate misaligned patterns throughout the network.
To address this, we introduce \textbf{RAH-LoRA (Representative Anchor Head Low-Rank Adaptation)}, a training-free calibration method that realigns these problematic heads by transferring successful patterns from high-performing anchors.
Our key insight is that the transformer's residual architecture enables safe pattern transfer between heads operating in the same representation space.
RAH-LoRA identifies bottleneck heads using our proposed metrics (Instruction-conditioned Saliency and Causal Attention Flow), constructs representative patterns from similar well-performing heads, and applies controlled low-rank updates with theoretical guarantees on output stability.
The method requires only forward passes on unlabeled data, completing calibration in minutes on a single GPU.
Experiments demonstrate consistent improvements across vision-language benchmarks, with gains strongly correlated to the identified influence-saliency gap, validating that targeting high-influence, low-cross-modal heads yields amplified benefits.
Supplementary Material: pdf
Primary Area: foundation or frontier models, including LLMs
Submission Number: 24886
Loading