Towards Task-Consistent Open-Vocabulary Adaptation in Video Recognition

18 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video Recognition, Vision-Language Models, Open-vocabulary Learning
TL;DR: We introduce TACO, a simple yet effective framework for open-vocabulary video adaptation, which aims to mitigate the potential negative effects induced by the inconsistency between fine-tuning and evaluation objectives.
Abstract: Transferring CLIP for open-vocabulary video recognition has shown impressive effectiveness. To fit the video domain, the model undergoes fine-tuning on a video dataset and is expected to generalize well on data with unseen categories. However, this fine-tuning paradigm overlooks the variations of the representation space beyond the training distribution, leading to the sub-optimal adaptation effect. In this paper, we introduce TACO, a simple yet effective framework to mitigate the potential negative effects induced by the inconsistency between fine-tuning and evaluation objectives. We formulate a more concrete adaptation principle by delving into the deficiencies of existing paradigms. Specifically, we propose a task decoupling method that mitigates the knowledge overfitting by incorporating a specialization projection. Moreover, we offer new insights into the preservation of the generalization and technically introduce \emph{Relative Structure Distillation}, which maintains the consistent relative structure between in-distribution and out-of-distribution representation spaces through knowledge distillation. Our proposed TACO establishes state-of-the-art performance on diverse benchmarks under cross-dataset and base-to-novel settings. Code will be released.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 13748
Loading