LIDet: Language Iterative Object Detection

LIDet: Language Iterative Object Detection

ACL ARR 2025 May Submission2853 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper proposes LIDet, a language-guided iterative object detection framework, designed to address challenges in open-vocabulary object detection, such as missed detections of small objects and rare categories, as well as false positives. Without retraining the detection model, the method constructs a four-stage closed-loop process:"image preprocessing → multimodal perception → object detection → language reasoning." Leveraging the semantic reasoning capabilities of large language models (LLMs), LIDet generates potential missing object categories and their spatial relationships based on detected objects and scene descriptions. This guides the visual detector to dynamically crop and re-examine image regions. Experiments demonstrate that LIDet achieves an average improvement of 3\% in Acc@IoU=0.25 on the RefCOCO series datasets compared to the MQADet and outperforms the original detection model. Although computationally intensive, LIDet establishes a language-vision interaction mechanism at the semantic level, offering a novel approach to multimodal reasoning and open-vocabulary object detection.

Paper Type: Short

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: cross-modal application, cross-modal information extraction

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 2853

Loading