Track: regular paper (up to 6 pages)
Keywords: Instructional Video Understanding, Bias, Spurious Correlation
TL;DR: Ordinal bias leads action recognition models to over-rely on dominant action pairs, inflating performance and lacking true video comprehension even when challenged by action masking and sequence shuffling.
Abstract: Action recognition models have shown promising results in understanding consecutive human actions in instructional videos. However, they often rely on dominant action patterns in datasets rather than achieving true video comprehension. We define this as ordinal bias, a systematic reliance on dataset-specific action sequences. To mitigate this, we introduce two simple yet effective video manipulation techniques: action masking and sequence shuffling, where the latter action in dominant pairs is masked, or the sequence is randomized. Our findings reveal that existing models still tend to rely on dominant action pairs and struggle to adapt, highlighting their overestimated performance and lack of robustness.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).
Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.
Presenter: ~Joochan_Kim1
Submission Number: 32
Loading