More Views, More Problems? A Critical Analysis of Multi-View Aggregation for Agent Perception

Published: 31 Mar 2026, Last Modified: 31 Mar 2026ARMS 2026 OralEveryoneRevisionsCC BY 4.0
AAMAS Extended Abstract ID: 1517
Keywords: Autonomous Robots, Active Perception, Vision-Language Models, Multi-view Synthesis
Abstract: For an autonomous agent to interact effectively with its environment, it must construct accurate and robust representations from visual data. Multi-view perception is generally expected to enhance understanding, but the question of how to best integrate information across views remains unresolved. In this work, we examine an LLM-based synthesis strategy in which captions generated from multiple viewpoints by a VLM are aggregated into a single, unified description. We evaluate this approach against three conditions: a canonical single-view baseline, a naive average, and an oracle that selects the most informative viewpoint. Experiments are conducted on two datasets: a collection of in-domain, real-world objects and a domain-shifted set of 3D-printed objects. Our results demonstrate that synthesis can successfully combine complementary information when per-view captions are reliable, yielding descriptions superior to those from a static view. However, when domain shift reduces caption quality, the same strategy degrades substantially, often performing worse than the baseline. These findings highlight the brittleness of current multi-view aggregation methods and underscore the need for more robust information-fusion mechanisms to ensure reliable perception in autonomous agents, particularly in robotic settings where accurate scene understanding directly impacts control, manipulation, and safety.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 2
Loading