Populate-A-Scene: Affordance-Aware Human Video Generation

Mengyi Shan; Zecheng He; Haoyu Ma; Felix Juefei-Xu; Peizhao Zhang; Tingbo Hou; Ching-Yao Chuang

Populate-A-Scene: Affordance-Aware Human Video Generation

Mengyi Shan, Zecheng He, Haoyu Ma, Felix Juefei-Xu, Peizhao Zhang, Tingbo Hou, Ching-Yao Chuang

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Generation, Human-Centric Generation, Affordance, Representation Visualization

TL;DR: We explore the affordance perception potential of text-to-video models by teaching them to predict human-environment interaction.

Abstract: Can a video generation model be repurposed as an interactive world simulator? We explore the affordance perception potential of text-to-video models by teaching them to predict human-environment interaction. Given a scene image and a prompt describing human actions, we fine-tune the model to insert a person into the scene, while ensuring coherent behavior, appearance, harmonization, and scene affordance. Unlike prior work, we infer human affordance for video generation (i.e., where to insert a person and how they should behave) from a single scene image, without explicit conditions like bounding boxes or body poses. An in-depth study of cross-attention heatmaps demonstrates that we can uncover the inherent affordance perception of a pre-trained video model without labeled affordance datasets.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 10069

Loading