Autogenic Language Embedding for Coherent Point Tracking

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Point tracking is a challenging task in computer vision, aiming to establish point-wise correspondence across long video sequences. Recent advancements have primarily focused on temporal modeling techniques to improve local feature similarity, often overlooking the valuable semantic consistency inherent in tracked points. In this paper, we introduce a novel approach leveraging language embeddings to enhance the coherence of frame-wise visual features related to the same object. We recognize that videos typically involve a limited number of objects with specific semantics, allowing us to automatically learn language embeddings. Our proposed method, termed autogenic language embedding for visual feature enhancement, strengthens point correspondence in long-term sequences. Unlike existing visual-language schemes, our approach learns text embeddings from visual features through a dedicated mapping network, enabling seamless adaptation to various tracking tasks without explicit text annotations. Additionally, we introduce a consistency decoder that efficiently integrates text tokens into visual features with minimal computational overhead. Through enhanced visual consistency, our approach significantly improves point tracking trajectories in lengthy videos with substantial appearance variations. Extensive experiments on widely-used point tracking benchmarks demonstrate the superior performance of our method, showcasing notable enhancements compared to trackers relying solely on visual cues.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: This work proposes a autogenic language-assisted visual consistency for the point tracking task. We thoroughly analyze the text-embedded visual features in vision-language models, and design an autogenic language-assisted visual feature enhancement to reinforce point correspondence in long-term sequences. This work involves integrating two modalities: language and vision, and explores the problem of using language modalities to enhance the consistency of visual modalities across long video frames. Thus this work plays a crucial role in advancing multimedia processing.
Submission Number: 2573
Loading