COST: Contrastive one-stage transformer for vision-language small object tracking

Published: 2026, Last Modified: 06 Nov 2025Inf. Fusion 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•A new contrastive one-stage fusion framework for efficient vision-language tracking.•Contrastive alignment regularizes cross-modal feature learning in a unified space.•Proposes VL-SOT500, the first large-scale multi-modal small object tracking dataset.•Achieves superior performance on five benchmarks and VL-SOT500 dataset.•Provides valuable insights for future vision-language tracking research.
Loading