Semantic Event Graphs for Long-Form Video Question Answering

Published: 28 Dec 2025, Last Modified: 08 Mar 2026AAAI 2026 Bridge LMReasoningEveryoneRevisionsBibTeXCC BY 4.0
Keywords: long-form video QA, temporal reasoning, symbolic representations, scene graphs, vision-language models, token efficiency
Abstract: Long-form video question answering remains challenging for modern vision–language models, which struggle to reason over hour-scale footage without exceeding practical token and compute budgets. Existing systems typically downsample frames or feed dense visual embeddings to large-context lan- guage models, trading off temporal coverage against cost. We propose Semantic Event Graphs (SEG), a lightweight sym- bolic interface between video and language that replaces raw frames with compact temporal interaction logs. Our pipeline detects and tracks objects with YOLOv11, converts proximity patterns into START/END human–object events, and organizes them into a Temporal Scene Graph (TSG). At inference time, a query-aware pruning module identifies anchor en- tities and lexically relevant events, returning only a small subgraph which is verbalized and passed to Gemini 2.5 Flash for answer generation. On five YouTube videos (300–500 interactions each) and 120 automatically gen- erated long-horizon questions, SEG achieves 65.0% accuracy using only 3.47k tokens per query, closely matching a full- log baseline (62.5% at 40.39k tokens) while reducing token usage by 91.4%. A short-context baseline restricted to the last 30 seconds collapses to 2.5% accuracy, underscoring the need for explicit temporal memory. These results show that symbolic temporal graphs can serve as an effective, plug-and- play memory layer for off-the-shelf vision–language models, preserving long-range reasoning ability while making long- form video question answering substantially more token- and cost-efficient. Code, logs, and event-extraction tools will be released for reproducibility.
Submission Number: 106
Loading