ClickDiff: Click to Induce Semantic Contact Map for Controllable Grasp Generation with Diffusion Models
Abstract: Grasp generation aims to create complex hand-object interactions with a specified object. While traditional approaches for hand generation have primarily focused on visibility and diversity under scene constraints, they tend to overlook the fine-grained hand-object interactions such as contacts, resulting in inaccurate and undesired grasps. To address these challenges, we propose a controllable grasp generation task and introduce ClickDiff, a controllable conditional generation model that leverages a fine-grained Semantic Contact Map (SCM). Particularly when synthesizing interactive grasps, the method enables the precise control of grasp synthesis through either user-specified or algorithmically predicted Semantic Contact Map. Specifically, to optimally utilize contact supervision constraints and to accurately model the complex physical structure of hands, we propose a Dual Generation Framework. Within this framework, the Semantic Conditional Module generates reasonable contact maps based on fine-grained contact information, while the Contact Conditional Module utilizes contact maps alongside object point clouds to generate realistic grasps. We evaluate the evaluation criteria applicable to controllable grasp generation. Both unimanual and bimanual generation experiments on GRAB and ARCTIC datasets verify the validity of our proposed method, demonstrating the efficacy and robustness of ClickDiff, even with previously unseen objects. Our code is available at https://anonymous.4open.science/r/ClickDiff.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Generation] Multimedia Foundation Models, [Experience] Interactions and Quality of Experience, [Experience] Multimedia Applications
Relevance To Conference: This work significantly contributes to multimedia/multimodal processing by advancing the state-of-the-art in hand grasping generation through the integration of diffusion models. By generating more natural and accurate hand grasps, this research enhances the interactivity and realism in multimedia applications such as virtual reality (VR), augmented reality (AR), and robotics. Specifically, the controllable grasp generation facilitates more nuanced and realistic human-computer and human-robot interactions, allowing for a more immersive user experience in VR and AR scenarios, and more effective grasp execution in robotics. Moreover, the use of diffusion models for hand grasp generation opens up new possibilities for synthesizing multimodal data. Thus, this work not only pushes the boundaries of what's possible in hand grasp synthesis but also contributes to the broader field of multimedia/multimodal processing by enhancing the quality of interactive experiences.
Supplementary Material: zip
Submission Number: 1180
Loading