VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use

VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use

ACL ARR 2025 May Submission1622 Authors

18 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Despite their strong performance in integrating textual and visual information, vision-language models (VLMs) still face challenges in fine-grained visual perception tasks that demand detailed pixel-level analysis and reasoning. We introduce VipAct, an agent framework that enhances VLMs through multi-agent collaboration and vision expert models for precise visual understanding and reasoning. VipAct features an orchestrator agent for task analysis, planning, and coordination, alongside specialized agents for tasks like image captioning and vision expert models for high-precision perception. This approach improves VLMs' fine-grained visual perception by integrating planning, reasoning, and tool use. We evaluate VipAct on diverse visual perception benchmarks, showing significant improvements over state-of-the-art baselines across multiple VLMs. Ablation studies highlight the importance of multi-agent collaboration for detailed System-2 reasoning and the critical role of image input in task planning. Error analysis further reveals inherent VLM limitations, offering insights for future improvements.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: cross-modal content generation, vision question answering, cross-modal application, multimodality

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 1622

Loading