Track: tiny / short paper (up to 3 pages)
Keywords: Associative Memory, Movement Disorders, Context Learning
Abstract: Multimodal foundation models, particularly those instruction fine-tuned for vision-language tasks, have recently gained prominence for their ability to parse and analyze complex video streams. Despite their effectiveness for broad, general-purpose queries, these models often struggle with domain-specific questions that demand deeper contextual understanding. The core limitation lies in their reliance on vision-language grounding extracted from raw video frames, which does not adequately capture nuanced context when the task is more specialized. In this paper, we introduce a method for *selective association in context memory* that addresses this shortcoming. Our approach leverages a targeted “association block” drawn from the extensive content in the model’s context window, focusing attention on the most relevant sub-scenes. By selectively filtering and organizing the visual stream, we enable more precise alignment of textual and visual cues for task-specific understanding. This mirrors the human cognitive strategy of associating smaller, relevant incidents to effectively recall and interpret them. We state examples to demonstrate the utility of our approach using examples from the medical domain—specifically, in analyzing videos of neurological movement disorders, where identifying subtle clinical cues requires robust context awareness.
Submission Number: 32
Loading