Visual Grounding Meets Language: CeAS and RAG for Bengali Long-Range Video Reasoning

Mohammad Abu Tareq Rony; Farjana Kabir; Mohammad Shariful Islam

Visual Grounding Meets Language: CeAS and RAG for Bengali Long-Range Video Reasoning

Mohammad Abu Tareq Rony, Farjana Kabir, Mohammad Shariful Islam

17 Sept 2025 (modified: 25 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Long-range Video Reasoning, Bengali Language Processing, Visual Question Answering, Large Language Models, Close-ended Answer Selection, Retrieval-Augmented Generation, Multimodal Captioning

TL;DR: CeAS and RAG for Bengali Long-Range Video Reasoning

Abstract: Long-range video question answering (VQA) remains a challenging task, especially in low-resource languages like Bengali, due to limited linguistic tools and the need for multi-step temporal reasoning. To address these challenges, we propose a training-free framework for Bengali Long-range Video Reasoning (BLrVR). Our approach adapts the EgoSchema benchmark to Bengali through high-quality translation and contextual validation. We introduce a novel prompting strategy, CeAS (Close-ended Answer Selection), which integrates structured roles, task cues, and strict constraints to guide LLM reasoning. Additionally, we explore a Retrieval-Augmented Generation (RAG) variant that fuses relevant caption context with external evidence for enriched inference. Empirical results show that CeAS achieves state-of-the-art performance, surpassing RAG in precision, recall, and runtime efficiency, despite matching in accuracy and F1-score. We further benchmark different captioners, LLMs, retrievers, and prompting schemes, providing a comprehensive evaluation of components crucial to BLrVR success. Our findings demonstrate that structured prompting can outperform retrieval-heavy methods in both effectiveness and efficiency for low-resource multimodal reasoning. The code is publicly released at: https://github.com/anaxy-code/Bengali-Long-Range-Video-Reasoning

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 8409

Loading