Closed-loop Scaling Up for Visual Object Tracking

Jack Hong; Shilin Yan; Zehao Xiao; Jiayin Cai; Xiaolong Jiang; Yao Hu; Henghui Ding

Closed-loop Scaling Up for Visual Object Tracking

Jack Hong, Shilin Yan, Zehao Xiao, Jiayin Cai, Xiaolong Jiang, Yao Hu, Henghui Ding

18 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Scaling law, Downstream vision tasks, Visual object tracking

TL;DR: We explore the scaling law in downstream vision tasks, and take visual object tracking as case study.

Abstract: Thanks to the principles of the scaling law, current neural networks have experienced remarkable performance improvements. While much of the existing research has concentrated on upstream pretraining, the application of the scaling law to downstream vision tasks remains underexplored. Understanding the scaling law in downstream tasks can aid in the design of more effective models and training strategies. Thus, in this work, we aim to investigate the application of the scaling law to downstream vision tasks. Firstly, we explore the impact of three key factors of scaling law: training data volume, model size, and input resolution. We empirically verify that increasing each of these factors can lead to performance enhancements. Secondly, to address naive training's optimization challenges and lack of iterative refinement, we introduce DT-Training which leverages small teacher transfer and dual-branch alignment to further exploit model potential. Thirdly, building on DT-Training, we propose a closed-loop scaling strategy to incrementally scale the model step-by-step. Finally, our scaled model exhibits strong ability and outperforms existing counterparts across diverse test benchmarks. Extensive experiments also reveal the robust transfer ability of our model. Moreover, we validate the generalizability of the scaling law and our proposed DT-Training on other downstream vision tasks, reinforcing the broader applicability of our approach. We hope that our findings can deepen the understanding of the scaling law in downstream tasks and foster future developments on downstream tasks.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1572

Loading