Visual Grounding Meets Language: CeAS and RAG for Bengali Long-Range Video Reasoning

17 Sept 2025 (modified: 25 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Long-range Video Reasoning, Bengali Language Processing, Visual Question Answering, Large Language Models, Close-ended Answer Selection, Retrieval-Augmented Generation, Multimodal Captioning
TL;DR: CeAS and RAG for Bengali Long-Range Video Reasoning
Abstract: Long-range video question answering (VQA) remains a challenging task, especially in low-resource languages like Bengali, due to limited linguistic tools and the need for multi-step temporal reasoning. To address these challenges, we propose a training-free framework for Bengali Long-range Video Reasoning (BLrVR). Our approach adapts the EgoSchema benchmark to Bengali through high-quality translation and contextual validation. We introduce a novel prompting strategy, CeAS (Close-ended Answer Selection), which integrates structured roles, task cues, and strict constraints to guide LLM reasoning. Additionally, we explore a Retrieval-Augmented Generation (RAG) variant that fuses relevant caption context with external evidence for enriched inference. Empirical results show that CeAS achieves state-of-the-art performance, surpassing RAG in precision, recall, and runtime efficiency, despite matching in accuracy and F1-score. We further benchmark different captioners, LLMs, retrievers, and prompting schemes, providing a comprehensive evaluation of components crucial to BLrVR success. Our findings demonstrate that structured prompting can outperform retrieval-heavy methods in both effectiveness and efficiency for low-resource multimodal reasoning. The code is publicly released at: https://github.com/anaxy-code/Bengali-Long-Range-Video-Reasoning
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8409
Loading