Orion: A Fully Deterministic and Interpretable Pipeline for Video Scene Graph Generation with Explicit Causal Influence Scoring

Published: 02 Mar 2026, Last Modified: 05 Mar 2026ES-Reasoning @ ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video Scene Graph Generation and Anticipation
Abstract: Understanding interaction dynamics in egocentric video remains a fundamental challenge for grounded vision language systems. Existing approaches detect objects and actions but often fail to synthesize them into temporally consistent, interpretable relational representations. We introduce Orion, a fully deterministic and interpretable pipeline that transforms raw perceptual streams into symbolic, queryable knowledge graphs through a process termed \emph{semantic uplift}. Semantic uplift converts low-level detections, embeddings, and tracks into structured entities, relations, and influence-aware events. Orion integrates modular perception components, using DINO-based backbones for object proposals, V-JEPA2 for appearance representation and re-identification, and a lightweight FastVLM-style backend for natural language entity descriptions. Entities and relations are assembled into temporally aligned scene graphs. An explicit Causal Influence Score (CIS) deterministically aggregates temporal, spatial, motion, and semantic evidence to estimate directed influence between entity pairs, enabling transparent and auditable reasoning about interaction patterns. Orion positions semantic uplift as a bridge between low-level vision outputs and high-level relational representations while remaining fully interpretable.
Submission Number: 39
Loading