Abstract: This paper proposes LIDet, a language-guided iterative object detection framework, designed to address challenges in open-vocabulary object detection, such as missed detections of small objects and rare categories, as well as false positives. Without retraining the detection model, the method constructs a four-stage closed-loop process:"image preprocessing → multimodal perception → object detection → language reasoning." Leveraging the semantic reasoning capabilities of large language models (LLMs), LIDet generates potential missing object categories and their spatial relationships based on detected objects and scene descriptions. This guides the visual detector to dynamically crop and re-examine image regions. Experiments demonstrate that LIDet achieves an average improvement of 3\% in Acc@IoU=0.25 on the RefCOCO series datasets compared to the MQADet and outperforms the original detection model. Although computationally intensive, LIDet establishes a language-vision interaction mechanism at the semantic level, offering a novel approach to multimodal reasoning and open-vocabulary object detection.
Paper Type: Short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal application, cross-modal information extraction
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 2853
Loading