Deformable Contact-Aware 3D Object Placement

ICLR 2026 Conference Submission20686 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: 3D Vision
TL;DR: N/A
Abstract: We study language-guided object placement in real 3D scenes when contact is \emph{deformable and frictional}. Rather than guessing a rigid pose that “looks right,” we cast placement as a \emph{drop-to-equilibrium} problem: if the support, scale, and a reasonable pre-drop pose are provided, physics should determine where the object actually rests. Our pipeline, \textbf{DCAP}, couples language/vision priors with simulation. An LLM extracts the intended support and a realistic size prior; a minimal three-view VLM query returns a single rotation; and a sub-part–aware LLM selects the exact target region, after which we raycast to place the object 1cm above it—no “upward-facing” constraint required. We assign per-part materials by \emph{hard} mapping of semantic labels to a curated library, split parts into rigid vs.\ MPM by stiffness, fill soft parts with particles, and then drop to equilibrium with a corotated-elastic MPM solver. To evaluate deformable placement, we convert 186 high-fidelity indoor scenes to watertight meshes by rendering multi-view images from InteriorGS and extracting surfaces with SuGaR. We score methods along two axes—\emph{Right Place} and \emph{Physics \& Naturalness}—using both a human-aligned VLM protocol and forced-choice human studies. DCAP substantially outperforms language-only and rigid-constraint baselines on both axes, produces visible, material-consistent deformations, and correctly flags infeasible instructions. Finally, using DCAP’s settled geometry as conditioning improves downstream 2D insertions, indicating that physically justified final states are valuable beyond simulation.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 20686
Loading