VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors

Atif Belal; Heitor Rapela Medeiros; Marco Pedersoli; Eric Granger

VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors

Atif Belal, Heitor Rapela Medeiros, Marco Pedersoli, Eric Granger

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Test Time Adaptation, Object detection, Vision language model, vision language object detectors

TL;DR: We introduce VLOD-TTA, to our knowledge the first test-time adaptation framework for vision-language object detectors (VLODS).

Abstract: Vision–language object detectors (VLODs) such as YOLO-World and Grounding DINO achieve impressive zero-shot recognition by aligning region proposals with text representations. However, their performance often degrades under domain shift. We introduce VLOD-TTA, a test-time adaptation (TTA) framework for VLODs that leverages dense proposal overlap and image-conditioned prompt scores. First, an IoU-weighted entropy objective is proposed that concentrates adaptation on spatially coherent proposal clusters and reduces confirmation bias from isolated boxes. Second, image-conditioned prompt selection is introduced, which ranks prompts by image-level compatibility and fuses the most informative prompts with the detector logits. Our benchmarking across diverse distribution shifts -- including stylized domains, driving scenes, low-light conditions, and common corruptions -- shows the effectiveness of our method on two state-of-the-art VLODs, YOLO-World and Grounding DINO, with consistent improvements over the zero-shot and TTA baselines.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 9781

Loading