Leveraging Prediction Inconsistency for Online Error Detection in Procedural Videos

02 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: video understanding, procedural videos, egocentric videos, online error detection, real time, action detection
Abstract: An efficient and accurate system for detecting errors in procedural tasks is crucial for supporting human needs in daily life, as it can provide instant notifications and guide people to correct mistakes. In this paper, we address the challenge of real-time and online error detection in procedural task videos by leveraging inconsistencies in action detector predictions. We propose a DUal-Branch Action Detector (DUBAD) framework, which integrates both \textit{robust} and \textit{sensitive} actions detectors. The \textit{robust} action detectors generate accurate and stable action predictions, while the \textit{sensitive} detectors produce inconsistent predictions when errors occur. To achieve this, we design a temporal-aware dynamic weight module that enhances sensitivity to errors using affine transformations with input-dependent, constrained weights and biases. Furthermore, we train the action detectors with varying amounts of temporal information to amplify inconsistencies in prediction when action sequences deviate from the correct order. For videos containing multiple or diverse errors, we apply a majority voting scheme based on mismatches between robust and sensitive predictions. Extensive experiments on EgoPER, Assembly-101-O, and EPIC-Tent-O demonstrate that our method outperforms state-of-the-art approaches in online error detection, while maintaining real-time efficiency with a lightweight architecture.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 865
Loading