Keywords: Long-range Video Reasoning, Bengali Language Processing, Visual Question Answering, Large Language Models, Close-ended Answer Selection, Retrieval-Augmented Generation, Multimodal Captioning
TL;DR: CeAS and RAG for Bengali Long-Range Video Reasoning
Abstract: Long-range video question answering (VQA) remains a challenging task, especially
in low-resource languages like Bengali, due to limited linguistic tools and
the need for multi-step temporal reasoning. To address these challenges, we propose
a training-free framework for Bengali Long-range Video Reasoning (BLrVR).
Our approach adapts the EgoSchema benchmark to Bengali through high-quality
translation and contextual validation. We introduce a novel prompting strategy,
CeAS (Close-ended Answer Selection), which integrates structured roles, task
cues, and strict constraints to guide LLM reasoning. Additionally, we explore a
Retrieval-Augmented Generation (RAG) variant that fuses relevant caption context
with external evidence for enriched inference. Empirical results show that
CeAS achieves state-of-the-art performance, surpassing RAG in precision, recall,
and runtime efficiency, despite matching in accuracy and F1-score. We further
benchmark different captioners, LLMs, retrievers, and prompting schemes, providing
a comprehensive evaluation of components crucial to BLrVR success. Our
findings demonstrate that structured prompting can outperform retrieval-heavy
methods in both effectiveness and efficiency for low-resource multimodal reasoning.
The code is publicly released at: https://github.com/anaxy-code/Bengali-Long-Range-Video-Reasoning
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8409
Loading