BDC-CLIP: Brownian Distance Covariance for Adapting CLIP to Action Recognition

Fei Long; Xiaoou Li; Jiaming Lv; Haoyuan Yang; Xianjun Cheng; Peihua Li

BDC-CLIP: Brownian Distance Covariance for Adapting CLIP to Action Recognition

Fei Long, Xiaoou Li, Jiaming Lv, Haoyuan Yang, Xianjun Cheng, Peihua Li

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We adapt CLIP for action recognition based on Brownian distance covariance (BDC) using all visual and textual tokens. It addresses limitations of previous arts relying on cosine similarity, highlighting significance of advanced metrics like BDC.

Abstract: Bridging contrastive language-image pre-training (CLIP) to video action recognition has attracted growing interest. Human actions are inherently rich in spatial and temporal contexts, involving dynamic interactions among people, objects, and the environment. Accurately recognizing actions requires effectively capturing these fine-grained elements and modeling their relationships with language. However, most existing methods rely on cosine similarity--practically equivalent to the Pearson correlation coefficient--between global tokens for video-language alignment. As a result, they have limited capacity to model complex dependencies and tend to overlook local tokens that encode critical spatio-temporal cues. To overcome these limitations, we propose BDC-CLIP, a novel framework that leverages Brownian Distance Covariance (BDC) to align visual and textual representations. Our method can capture complex relationships--both linear and nonlinear--between all visual and textual tokens, enabling fine-grained modeling in space, time, and language. BDC-CLIP achieves state-of-the-art performance across zero-shot, few-shot, base-to-novel, and fully supervised action recognition settings, demonstrating its effectiveness and broad applicability.

Lay Summary: Image-language models like CLIP can identify novel objects from few—or even zero—examples. In contrast, video understanding poses greater challenges: actions evolve over time, and the most informative cues are often found in localized image patches and specific words. Yet, most existing methods squash entire clips and whole sentences to two global vectors and compute their similarity using cosine similarity—capturing only basic linear patterns. In doing so they discard the fine-grained cues that are crucial for accurate video recognition. We introduce BDC-CLIP, a novel framework that retains all visual patches across frames and all words from captions, aligning them using Brownian Distance Covariance (BDC)—a statistical dependency measure that captures both linear and non-linear relationships. A lightweight temporal adapter further aggregates BDC signals across frames, enabling the model to track interactions among objects, people, and actions over time. This results in a richer, token-level alignment between visual content and language. BDC-CLIP outperforms prior work in zero-shot, few-shot, and fully supervised action recognition tasks. Its fine-grained alignment mechanism also benefits downstream tasks such as video retrieval, captioning, and safety filtering, providing a reliable bridge between dynamic visual content and natural language descriptions.

Primary Area: General Machine Learning->Representation Learning

Keywords: Action recognition, video-language alignment, Brownian distance covariance, CLIP

Submission Number: 859

Loading