A Multimodal Chain of Tools for Described Object Detection

Published: 10 Oct 2024, Last Modified: 25 Dec 2024NeurIPS'24 Compositional Learning Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Described Object Detection
Abstract: Described object detection (DOD) is a promising direction for fine-grained and human-interactive visual recognition, where the goal is to detect target objects based on given language descriptions. Despite significant advancements in language-based object detection, current models still struggle with complex descriptions due to limited compositional understanding. To address this issue, we propose a novel multimodal chain-of-tools (MCoTs) framework that seamlessly integrates specialized tools to handle the two core functionalities of the DOD task: localization and compositional reasoning. Specifically, we decompose the complex DOD task into a series of subtasks, with each subtask handled by specialized tools, including detector and multimodal large language model (MLLM). This simple yet effective MCoTs framework demonstrates significant performance improvements on the challenging D3 benchmark without additional training overhead.
Submission Number: 48
Loading