SpatialHand: Generative Object Manipulation from 3D Prespective

Zehan Wang; Jialei Wang; Siyu Chen; Ziang Zhang; Luping Liu; Xize Cheng; Kaihang Pan; Hengshuang Zhao; Zhou Zhao

SpatialHand: Generative Object Manipulation from 3D Prespective

Zehan Wang, Jialei Wang, Siyu Chen, Ziang Zhang, Luping Liu, Xize Cheng, Kaihang Pan, Hengshuang Zhao, Zhou Zhao

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: AIGC Application; Image Editing

Abstract: We introduce SpatialHand, a novel framework for generative object insertion with precise 3D control. Current generative object manipulation methods primarily operate within the 2D image plane, but often fail to grasp 3D scene complexities, leading to ambiguities in an object's 3D position, orientation, and occlusion relations. SpatialHand addresses this by conceptualizing object insertion from a true ``3D perspective," enabling manipulation with a complete 6 Degrees-of-Freedom (6DoF) controllability. Specifically, our solution naturally and implicitly encodes the 6DoF pose condition by decomposing it into 2D location (via masked image), depth (via composited depth map), and 3D orientation (embedded into latent features). To overcome the scarcity of paired training data, we develop an automated data construction pipeline using synthetic 3D assets, rendering, and subject-driven generation, complemented by visual foundation models for pose estimation. We further design a multi-stage training scheme to progressively drive SpatialHand to robustly follow multiple complex conditions. Extensive experiments reveal our approach's superiority over existing alternatives and its great potential for enabling more versatile and intuitive AR/VR-like object manipulation within images.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 16119

Loading