[Tiny Paper] GEST-Engine: Controllable Multi-Actor Video Synthesis with Perfect Spatiotemporal Annotations

Published: 02 Mar 2026, Last Modified: 05 Mar 2026ICLR 2026 Workshop World ModelsEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Synthetic Video Generation, Video Generation, World Model, Video Segmentation, Video Scene Graph, Simulation Environment
TL;DR: We present a system capable of orchestrating multiple actors through time, that produces ground truth visual, spatial, and temporal, annotations for synthetically generated videos.
Abstract: The world is a complex and dynamic place with multiple concurrent events that happen constantly between entities such as people and objects. While large-scale datasets with annotations obtained manually or through automatic post-processing of videos exist and facilitate the training of world models, few of them capture this complexity with ground truth annotations. In this paper, we introduce GEST-Engine, a system that given a Graph of Events in Space and Time (GEST) --- a specification that encompasses entities, events, and temporal constraints --- makes use of game world simulation to generate in a controlled manner, complex multi-actor and multi-object videos with pixel level ground truth annotations and frame-synchronized temporal segments, cross actor temporal relations, and cross entity spatial relations, reverse-mapped to the initial specification. We describe our complete end-to-end workflow that encompasses random GEST generation, a scalable pipeline for artifact generation and collection, and a sample corpus of 398 multi-actor videos spanning 37 action types with dense annotations at zero marginal cost.
Supplementary Material: zip
Submission Number: 107
Loading