Abstract: Highlights•Introduce comprehensive bag of textual descriptions for VOT tracking.•Provide a comprehensive bag of textual descriptions for six VOT datasets.•Propose TTFUM to update target text features over time.•Fuse visual and textual features using attention-based correlation.•Evaluate CLDTrack on six benchmarks against 38 SOTA trackers.
Loading