Track: regular paper (up to 6 pages)
Keywords: Instructional Video Understanding, Action Recognition, Bias, Spurious Correlation
TL;DR: Ordinal bias leads action recognition models to over-rely on dominant action pairs, inflating performance and lacking true video comprehension even when challenged by action masking and sequence shuffling.
Abstract: Action recognition models have achieved promising results in understanding instructional videos. However, they often rely on dominant, dataset-specific action sequences rather than true video comprehension, a problem that we define as ordinal bias. To address this issue, we propose two effective video manipulation methods: Action Masking, which masks frames of frequently co-occurring actions, and Sequence Shuffling, which randomizes the order of action segments. Through comprehensive experiments, we demonstrate that current models exhibit significant performance drops when confronted with nonstandard action sequences, underscoring their vulnerability to ordinal bias. Our findings emphasize the importance of rethinking evaluation strategies and developing models capable of generalizing beyond fixed action patterns in diverse instructional videos.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 32
Loading