LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: LV-XAttn is a distributed cross-attention mechanism with minimal communication overhead for multimodal large language models.
Abstract: Cross-attention is commonly adopted in multimodal large language models (MLLMs) for integrating visual information into the language backbone. However, in applications with large visual inputs, such as video understanding, processing a large number of visual tokens in cross-attention layers leads to high memory demands and often necessitates distributed computation across multiple GPUs. Existing distributed attention mechanisms face significant communication overheads, making cross-attention layers a critical bottleneck for efficient training and inference of MLLMs. To address this, we propose LV-XAttn, a distributed, exact cross-attention mechanism with minimal communication overhead. We observe that in applications involving large visual inputs, the size of the query block is typically much smaller than that of the key-value blocks. Thus, in LV-XAttn we keep the large key-value blocks locally on each GPU and exchange smaller query blocks across GPUs. We also introduce an efficient activation recomputation technique to support longer visual context. We theoretically analyze the communication benefits of LV-XAttn and show that it can achieve speedups for a wide range of models. Our evaluations with Llama 3-V, mPLUG-Owl3 and OpenFlamingo models find that LV-XAttn achieves up to 10.62$\times$ end-to-end speedup compared to existing approaches.
Lay Summary: AI models that understand videos (imagine movies!) are extremely large -- often too big to fit on a single computer. To handle them, researchers typically split the workload across multiple computers. However, this can be very slow because the computers need to exchange large amounts of data during processing. A major bottleneck comes from a part of these models called **cross-attention**, which helps the AI connect visual information (like video frames) to language. In our work, we introduce a new way of splitting the cross-attention workload, called **LV-XAttn**, that significantly reduces the amount of data computers need to exchange without changing the model’s output. The key idea is to keep the largest pieces of data local and only exchange the smaller ones, thereby reducing time spent on communication. We also design a memory-efficient technique that allows the model to handle even longer videos. Our approach works seamlessly with several popular AI models and can make them up to 10 times faster. This makes it more practical to train and deploy powerful AI models that can understand long videos.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/uw-mad-dash/LV-XAttn
Primary Area: Deep Learning->Attention Mechanisms
Keywords: Distributed System, Cross-Attention, Long Context, Multimodal Large Language Model
Submission Number: 7477
Loading