Keywords: Visual Reasoning, Active Reasoning, Uncertainty, LLM
Abstract: Real-world reasoning rarely reduces to static question answering: agents must actively gather information from tools and sensors that are often noisy and incorrect. However, most existing active reasoning benchmarks either focus on environments where feedback is largely reliable or inject noise without providing an explicit, calibrated uncertainty signal about tool outputs, making it difficult to analyze how LLMs should reason with uncertain evidence. We introduce VAR, a novel benchmark for active reasoning under noisy visual feedback that is explicitly designed to evaluate text-only LLM reasoners: a fixed, off-the-shelf VLM is treated as a stochastic visual sensor, and the LLM must solve VQA problems solely by querying this sensor. For each sensor query, we draw multiple samples and expose a coarse uncertainty signal via self-consistency, enabling the reasoner to probe from different angles and decide what to ask next and when to stop. Our construction is automatic and scalable: starting from diverse VQA sources and two modern VLMs, we select instances where the sensor is inconsistent yet human-solvable. VAR thus provides a controlled playground to study how different LLMs exploit uncertainty signals for robust reasoning.
Paper Type: Short
Research Area: AI/LLM Agents
Research Area Keywords: AI/LLM Agents, Dialogue and Interactive Systems, Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 9926
Loading