Keywords: Egocentric Video, Activity Understanding, Video Understanding
TL;DR: Given a single video, we predict how all tasks in that video can be carried out by multiple collaborating agents, allowing parallel task execution, using a VLM with prompts that encode the goals and constraints.
Abstract: Humans can intuitively parallelise complex activities, but can a model learn this
from observing a single person? Given one egocentric video, we introduce the
N-Body Problem: how N individuals, can together perform the same set of tasks
observed in this video. The goal is to maximise speed-up, but naive task allocation
often violates real-world constraints, leading to physically impossible scenarios
like two people using the same object or occupying the same space. To address
this, we formalise the N-Body Problem and propose a suite of metrics to evalu-
ate both performance (speed-up, task coverage) and feasibility (spatial collisions,
object conflicts). We then introduce a structured prompting strategy that guides a
Vision-Language Model (VLM) to reason about the 3D environment, object us-
age, and temporal dependencies to produce a viable parallel execution. On 100
videos from EPIC-Kitchens and HD-EPIC, our method for N = 2 boosts action
coverage by 45% over a baseline prompt for Gemini 2.5 Pro, while simultaneously
slashing collision rates by 55% and object conflicts by 45%.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3821
Loading