Revisiting Incremental Object Detection with Pre-Trained Vision-Language Models

18 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Object Detection, Incremental Learning, Vision Language Model
Abstract: Pre-trained Vision-Language Models (VLMs) have recently been applied to Incremental Object Detection (IOD), achieving notable progress. However, existing researches often oversimplify real-world scenarios by assuming the incremental tasks come from a single general domain. To better investigate VLMs under IOD, it is necessary to explore more generalized scenarios that encompass both novel categories and domains. To this end, we propose Cross-Domain Incremental Object Detection (CDIOD), a new benchmark that assesses the ability to continuously adapt to diverse object detection tasks across domains. CDIOD reveals that existing methods struggle to balance between adaptivity and stability under substantial domain shifts. To tackle this challenge, we propose $\textbf{D$^3$}$, a novel framework that possesses $\textbf{D}$ynamic grouping to promote knowledge sharing and prevent task collisions; $\textbf{D}$ynamic adapter assignment to effectively adapt to new tasks while controlling model scale; and $\textbf{D}$ynamic training pipeline to ensure a proper stability-adaptivity balance. D$^3$ enables VLMs to effectively handle task streams of various distribution shifts. Extensive experiments demonstrate that D$^3$ achieves state-of-the-art results across three benchmarks, highlighting its versatility and robustness in diverse incremental learning scenarios.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10420
Loading