Visual Agents as Fast and Slow Thinkers

ICLR 2025 Conference Submission1406 Authors

18 Sept 2024 (modified: 21 Nov 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Large Language Model, System2 Thinking, Language Agent
Abstract: Achieving human-level intelligence requires refining cognitive distinctions between \textit{System 1} and \textit{System 2} thinking. While contemporary AI, driven by large language models, demonstrates human-like traits, it falls short of genuine cognition. Transitioning from structured benchmarks to real-world scenarios presents challenges for visual agents, often leading to inaccurate and overly confident responses. To address the challenge, we introduce \textbf{\textsc{FaST}}, which incorporates the \textbf{Fa}st and \textbf{S}low \textbf{T}hinking mechanism into visual agents. \textsc{FaST} employs a switch adapter to dynamically select between \textit{System 1/2} modes, tailoring the problem-solving approach to different task complexity. It tackles uncertain and unseen objects by adjusting model confidence and integrating new contextual data. With this novel design, we advocate a \textit{flexible system}, \textit{hierarchical reasoning} capabilities, and a \textit{transparent decision-making} pipeline, all of which contribute to its ability to emulate human-like cognitive processes in visual intelligence. Empirical results demonstrate that \textsc{FaST} outperforms various well-known baselines, achieving 80.8\% accuracy over $VQA^{v2}$ for visual question answering and 48.7\% $GIoU$ score over ReasonSeg for reasoning segmentation, demonstrate \textsc{FaST}'s superior performance. Extensive testing validates the efficacy and robustness of \textsc{FaST}'s core components, showcasing its potential to advance the development of cognitive visual agents in AI systems.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1406
Loading