[Tiny Paper] GEST-Engine: Controllable Multi-Actor Video Synthesis with Perfect Spatiotemporal Annotations

Nicolae Cudlenco; Mihai Masala; Marius Leordeanu

[Tiny Paper] GEST-Engine: Controllable Multi-Actor Video Synthesis with Perfect Spatiotemporal Annotations

Nicolae Cudlenco, Mihai Masala, Marius Leordeanu

Published: 02 Mar 2026, Last Modified: 15 Apr 2026ICLR 2026 Workshop World ModelsEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Synthetic Video Generation, Video Generation, World Model, Video Segmentation, Video Scene Graph, Simulation Environment

TL;DR: We present a system capable of orchestrating multiple actors through time, that produces ground truth visual, spatial, and temporal, annotations for synthetically generated videos.

Abstract: The world is a complex and dynamic place with multiple concurrent events that happen constantly between entities such as people and objects. While large-scale datasets with annotations obtained manually or through automatic post-processing of videos exist and facilitate the training of world models, few of them capture this complexity with ground-truth annotations. We introduce GEST-Engine, a system that given a Graph of Events in Space and Time (GEST) — a specification that encompasses entities, events, and temporal constraints — makes use of game world simulation to generate in a controlled manner, complex multi-actor and multi-object videos with pixel-level ground-truth annotations and frame-synchronized temporal segments, cross actor temporal relations, and cross entity spatial relations, reverse-mapped to the initial specification. We describe our complete end-to-end workflow that encompasses random GEST generation, a scalable pipeline for artifact generation and collection, and a sample corpus of 398 multi-actor videos spanning 37 action types with dense annotations at zero marginal cost.

Supplementary Material: zip

Submission Number: 107

Loading