everyone
since 10 Jun 2025">EveryoneRevisionsBibTeXCC BY 4.0
Source-free video unsupervised domain adaptation (SFVUDA) represents a significant challenge in action recognition research. It requires adapting a pretrained model from a labeled source domain to an unlabeled target domain, with the constraint that source data remains inaccessible during adaptation. Despite advances in SFVUDA approaches, their performance remains significantly inferior to that of the supervised approach. We argue that a key reason for this performance bottleneck is the presence of variable static backgrounds in videos, which contribute substantially to domain shifts. To address this, we propose Motion-Focused Tokenization (MFT) for SFVUDA. In MFT, we first tokenize source and target video frames into patch tokens, then suppress the low-motion tokens, which largely belong to the background, while retaining the motion-rich tokens corresponding to actions for domain adaptation. Experiments introducing MFT to the best-performing existing SFVUDA method demonstrate a significant improvement ($\sim$2%) in its performance across two popular domain adaptation (DA) benchmarks, Daily-DA and UCF-HMDB, covering 15 different DA settings.