The N-Body Problem: Predicting Parallel Execution from Single-Person Egocentric Video

10 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Egocentric Video, Activity Understanding, Video Understanding
TL;DR: Given a single video, we predict how all tasks in that video can be carried out by multiple collaborating agents, allowing parallel task execution, using a VLM with prompts that encode the goals and constraints.
Abstract: Humans can intuitively parallelise complex activities, but can a model learn this from observing a single person? Given one egocentric video, we introduce the N-Body Problem: how N individuals, can together perform the same set of tasks observed in this video. The goal is to maximise speed-up, but naive task allocation often violates real-world constraints, leading to physically impossible scenarios like two people using the same object or occupying the same space. To address this, we formalise the N-Body Problem and propose a suite of metrics to evalu- ate both performance (speed-up, task coverage) and feasibility (spatial collisions, object conflicts). We then introduce a structured prompting strategy that guides a Vision-Language Model (VLM) to reason about the 3D environment, object us- age, and temporal dependencies to produce a viable parallel execution. On 100 videos from EPIC-Kitchens and HD-EPIC, our method for N = 2 boosts action coverage by 45% over a baseline prompt for Gemini 2.5 Pro, while simultaneously slashing collision rates by 55% and object conflicts by 45%.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3821
Loading