Motion-Focused Tokenization for Source-Free Video Domain Adaptation

ICML 2025 Workshop TokShop Submission23 Authors

Published: 10 Jun 2025, Last Modified: 18 Jun 2025TokShopEveryoneRevisionsBibTeXCC BY 4.0
Archiving Submission: No (non-archival)
Previous Venue If Non Archival: N/A
Keywords: Video tokenization, Domain adaptation, Action recognition
Abstract:

Source-free video unsupervised domain adaptation (SFVUDA) represents a significant challenge in action recognition research. It requires adapting a pretrained model from a labeled source domain to an unlabeled target domain, with the constraint that source data remains inaccessible during adaptation. Despite advances in SFVUDA approaches, their performance remains significantly inferior to that of the supervised approach. We argue that a key reason for this performance bottleneck is the presence of variable static backgrounds in videos, which contribute substantially to domain shifts. To address this, we propose Motion-Focused Tokenization (MFT) for SFVUDA. In MFT, we first tokenize source and target video frames into patch tokens, then suppress the low-motion tokens, which largely belong to the background, while retaining the motion-rich tokens corresponding to actions for domain adaptation. Experiments introducing MFT to the best-performing existing SFVUDA method demonstrate a significant improvement ($\sim$2%) in its performance across two popular domain adaptation (DA) benchmarks, Daily-DA and UCF-HMDB, covering 15 different DA settings.

Submission Number: 23
Loading