Abstract: Traditional approaches to adapting multi-modal large language models (MLLMs) to new tasks have relied heavily on fine-tuning. This paper introduces Efficient Multi-Modal Long Context Learning (EMLoC), a novel training-free alternative that embeds demonstration examples directly into the model input. EMLoC offers a more efficient, flexible, and scalable solution for task adaptation. Because extremely lengthy inputs introduce prohibitive computational and memory overhead, EMLoC contributes a chunk-wise compression mechanism combined with layer-wise adaptive pruning. It condenses long-context multimodal inputs into compact, task-specific memory representations. By adaptively pruning tokens at each layer under a Jensen-Shannon divergence constraint, our method achieves a dramatic reduction in inference complexity without sacrificing performance. This approach is the first to seamlessly integrate compression and pruning techniques for multi-modal long-context learning, offering a scalable and efficient solution for real-world applications. Extensive experiments on diverse vision-language benchmarks demonstrate that EMLoC achieves performance on par with or superior to naive long-context approaches. Our results highlight the potential of EMLoC as a groundbreaking framework for efficient and flexible adaptation of multi-modal models in resource-constrained environments. Codes are publicly available at https://github.com/Zehong-Ma/EMLoC.
Lay Summary: Today’s powerful AI models can understand both images and text, but adapting them to new tasks usually requires a lot of extra training, which is time-consuming and expensive. Our work introduces EMLoC, a new method that avoids extra training and instead allows the model to learn new tasks just by reading a few examples, much like how people learn from demonstrations.
However, feeding large amounts of image and text data into these models creates technical challenges—it’s slow and uses a lot of memory. To solve this, we developed a way to compress and simplify the input while keeping the important information. Think of it like summarizing a long story without losing the key points. Our method smartly decides which parts of the input can be skipped or shortened at each stage of the model, keeping the results accurate while making everything run much faster.
This makes it easier to use these AI systems in real-world situations where speed and limited computing resources matter, such as mobile apps or robots. You can find our code and try it out yourself at https://github.com/Zehong-Ma/EMLoC.
Link To Code: https://github.com/Zehong-Ma/EMLoC
Primary Area: Applications->Computer Vision
Keywords: Multi-modal Long Context Learning, Training-free Adaptation, Adaptive Pruning
Submission Number: 7295
Loading