Abstract: Despite their strong performance in integrating textual and visual information, vision-language models (VLMs) still face challenges in fine-grained visual perception tasks that demand detailed pixel-level analysis and reasoning. We introduce VipAct, an agent framework that enhances VLMs through multi-agent collaboration and vision expert models for precise visual understanding and reasoning. VipAct features an orchestrator agent for task analysis, planning, and coordination, alongside specialized agents for tasks like image captioning and vision expert models for high-precision perception. This approach improves VLMs' fine-grained visual perception by integrating planning, reasoning, and tool use. We evaluate VipAct on diverse visual perception benchmarks, showing significant improvements over state-of-the-art baselines across multiple VLMs. Ablation studies highlight the importance of multi-agent collaboration for detailed System-2 reasoning and the critical role of image input in task planning. Error analysis further reveals inherent VLM limitations, offering insights for future improvements.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal content generation, vision question answering, cross-modal application, multimodality
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 1622
Loading