PedGraph: Resolving Pragmatic Ambiguity in Instructional Videos through Function-Aware Event Detection
Keywords: vision language navigation, event extraction, cross-modal information extraction, multimodality, educational applications, knowledge graphs
Abstract: Understanding classroom instruction requires not only localizing what happens in a video, but also inferring \emph{why} it happens: visually similar behaviors (e.g., pointing) can serve different pedagogical functions under different discourse phases, creating profound pragmatic ambiguity. Yet existing video and vision-language models excel mainly at appearance-driven recognition and often lack an explicit representation of the relational logic that governs instructional interaction. To address this gap, we introduce \textbf{PedGraph}, a knowledge-guided framework that integrates a data-driven and expert-validated \textbf{S}tructured \textbf{T}eaching \textbf{I}nteraction \textbf{G}raph (\textbf{STIG}) to represent hierarchical, multi-relational pedagogical context; PedGraph injects STIG topology into representation learning via a structure-aware contrastive objective and performs global inference with a hierarchical relation-aware graph network to disambiguate event functions. We evaluate on \textbf{PEA}, a densely annotated instructional video benchmark (15.2 hours, 113 lessons, 32 event classes), where PedGraph outperforms strong baselines by 3.4 mAP@0.5 on function-aware event detection. Code, models, and the dataset will be released at \url{https://anonymous.4open.science/r/event-3A66/}.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision language navigation,event extraction,cross-modal information extraction,multimodality,educational applications,knowledge graphs
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources
Languages Studied: Chinese
Submission Number: 8889
Loading