Ask, Pose, Unite: Scaling Data Acquisition for Close Interaction Meshes with Vision Language Models

Published: 06 May 2025, Last Modified: 06 May 2025SynData4CVEveryoneRevisionsBibTeXCC BY 4.0
Keywords: human mesh estimation, dataset generation, close human interactions, weak supervision, vision language model
TL;DR: We propose a data generation method with LVLMs for producing diverse 3D meshes from monocular images of closely interacting people
Abstract: Social dynamics in close human interactions pose significant challenges for Human Mesh Estimation (HME), particularly due to the complexity of physical contacts and the scarcity of training data. Addressing these challenges, we introduce a novel data generation method Ask, Pose, Unite (APU) which utilizes Large Vision Language Models (LVLMs) to annotate contact maps to guide test-time optimization. APU produces paired image and pseudo-ground truth meshes from monocular images. Our method not only alleviates the annotation burden but also enables the assembly of a comprehensive dataset specifically tailored for close interactions in HME. Our dataset, comprising over 6.2k human mesh pairs in contact covering diverse interaction types, is curated from images depicting naturalistic person-to-person scenes. We empirically show that using data from APU improves mesh estimation on unseen interactions when training a diffusion-based contact prior. Our work addresses longstanding challenges of data scarcity for close interactions in HME enhancing the field’s capabilities of handling complex interaction scenarios. Our code, models and data will be made publicly available upon acceptance.
Submission Number: 7
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview