ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment

Amir Aghdam; Vincent Tao Hu; Björn Ommer

ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment

Amir Aghdam, Vincent Tao Hu, Björn Ommer

Published: 27 Oct 2025, Last Modified: 27 Oct 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: We address the task of zero-shot video classification for extremely fine-grained actions (e.g., Windmill Dunk in basketball), where no video examples or temporal annotations are available for unseen classes. While image–language models (e.g., CLIP, SigLIP) show strong open-set recognition, they lack temporal modeling needed for video understanding. We propose ActAlign, a truly zero-shot, training-free method that formulates video classification as a sequence alignment problem, preserving the generalization strength of pretrained image–language models. For each class, a large language model (LLM) generates an ordered sequence of sub-actions, which we align with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video–text supervision or fine-tuning, ActAlign achieves 30.4% accuracy on ActionAtlas—the most diverse benchmark of fine-grained actions across multiple sports—where human performance is only 61.6%. ActAlign outperforms billion-parameter video–language models while using $\sim 8\times$ fewer parameters. Our approach is model-agnostic and domain-general, demonstrating that structured language priors combined with classical alignment methods can unlock the open-set recognition potential of image–language models for fine-grained video understanding.

Submission Length: Regular submission (no more than 12 pages of main content)

Code: https://amir-aghdam.github.io/act-align/

Supplementary Material: zip

Assigned Action Editor: ~Liang-Chieh_Chen1

Submission Number: 5596

Loading