Enhancing object detection by leveraging large language models for contextual knowledge

Diego Patiño

Published: 03 Dec 2024, Last Modified: 26 Jan 2026International Conference on Pattern RecognitionEveryoneCC BY 4.0

Abstract: The adoption of deep learning-based object detection models has proliferated across numerous applications. However, their efficacy is significantly constrained under challenging imaging conditions like fog or occlusion. In response to these limitations, we present a novel approach that transcends these hurdles by exploiting scene contextual knowledge distilled from Large Language Models (LLMs). This methodology empowers our model to deduce and anticipate object presence within a scene by leveraging contextual knowledge akin to human perception, thereby overcoming the constraints of direct visual cues. Our method synergizes the capabilities of object detection models with the contextual interpretation and predictive capacity of LLaMA, an advanced LLM. Our framework operates exclusively on the labels and positional information provided by a detection algorithm, sidestepping the reliance on pixel-level image data both during training and inference. The effectiveness of our approach is validated through extensive experiments conducted on the COCO-2017 dataset, including a modified version simulating reduced visibility conditions. The empirical findings underscore the superior performance of our integrated model compared to standalone YOLO models, particularly evident in adverse conditions, where notable enhancements in detection accuracy are observed across various object sizes.