Keywords: vision question answering, multimodality, reasoning, cross-modal application, physical commonsense knowledge
TL;DR: We extend the problem of procedural mistake detection to require the generation of a visual dialog rationale, leveraging the physical and procedural commonsense knowledge in foundational vision-and-language models.
Abstract: Procedural mistake detection (PMD) is a challenging problem of classifying whether a human user (observed through egocentric video) has successfully executed a task (specified by a procedural text). Despite significant recent efforts, machine performance in the wild remains nonviable, and the knowledge and reasoning processes underlying this performance are opaque. As such, we extend PMD to require generating visual self-dialog rationales to inform decisions. Given the mature image understanding capabilities observed in recent vision-and-language models (VLMs), we curate a suitable benchmark dataset for PMD based on individual frames. As our reformulation enables unprecedented transparency, we leverage a natural language inference (NLI) model to formulate two automated metrics for the coherence of generated rationales. We establish baselines for this reframed task, showing that while VLMs struggle off-the-shelf, their accuracy, coherence, and efficiency can be improved by incorporating these metrics into common inference and fine-tuning methods- though not without tradeoff. We observe that these metrics visualize common outcomes, such as knowledge deficiencies.
Archival Status: Non-archival (not included in proceedings)
Submission Number: 13
Loading