Dynamic Context Adapters: Efficiently Infusing History into Vision-and-Language Models

18 Sept 2025 (modified: 18 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Efficient Deep Learning Methods, Light Weight Memory
Abstract: Transformer-based Vision-and-Language Models (VLMs) have set new benchmarks across diverse multimodal tasks by effectively aligning visual and linguistic inputs. Despite their remarkable success, existing VLMs process each visual input independently, which brings limitations to downstreamtasks that require integrating sequential historical context. Naively incorporating historical frames directly into Transformer inputs results in quadratic complexity in self-attention, excessive memory usage. Prior attempts using token concatenation methods severely inflate computational costs, while recurrent-based methods compress history at the cost of fine-grained temporal detail, leading to context degradation. Inspired by recent advances in parameter-efficient fine-tuning (PEFT) techniques, we propose a novel approach to efficiently inject additional contexts into pre-trained VLMs. Instead of directly concatenating history frames, our method maintains a fixed-size, dynamically compressed memory with historical semantics. We demonstrate that our approach achieves significantly reduced computational overhead, maintains fine-grained temporal fidelity of historical context, and shows impressive adaptability even with a smaller backbone.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 10854
Loading