Abstract: Highlights•A new contrastive one-stage fusion framework for efficient vision-language tracking.•Contrastive alignment regularizes cross-modal feature learning in a unified space.•Proposes VL-SOT500, the first large-scale multi-modal small object tracking dataset.•Achieves superior performance on five benchmarks and VL-SOT500 dataset.•Provides valuable insights for future vision-language tracking research.
External IDs:dblp:journals/inffus/ZhangLGSWZGW26
Loading