MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Generation, Hand-Object Interaction
TL;DR: MEgoHand is the starting point for generating high-quality motion sequences of hand-object interactions, conditioned on egocentric RGB images, textual instructions, and given initial MANO hand parameters.
Abstract: Egocentric hand-object motion generation is crucial for immersive AR/VR and robotic imitation but remains challenging due to unstable viewpoints, self-occlusions, perspective distortion, and noisy ego-motion. Existing methods rely on predefined 3D object priors, limiting generalization to novel objects, which restricts their generalizability to novel objects. Meanwhile, recent multimodal approaches suffer from ambiguous generation from abstract textual cues, intricate pipelines for modeling 3D hand-object correlation, and compounding errors in open-loop prediction. We propose **MEgoHand**, a multimodal framework that synthesizes physically plausible hand-object interactions from egocentric RGB, text, and initial hand pose. MEgoHand introduces a bi-level architecture: a high-level “cerebrum” leverages a vision language model (VLM) to infer motion priors from visual-textual context and a monocular depth estimator for object-agnostic spatial reasoning, while a low-level DiT-based flow-matching policy generates fine-grained trajectories with temporal orthogonal filtering to enhance stability. To address dataset inconsistency, we design a dataset curation paradigm with an Inverse MANO Retargeting Network and Virtual RGB-D Renderer, curating a unified dataset of **3.35M** RGB-D frames, **24K** interactions, and **1.2K** objects. Extensive experiments across **five** in-domain and **two** cross-domain datasets demonstrate the effectiveness of MEgoHand, achieving substantial reductions in wrist translation error (**86.9%**) and joint rotation error (**34.1%**), highlighting its capacity to accurately model fine-grained hand joint structures and generalize robustly across diverse scenarios.
Supplementary Material: zip
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 9051
Loading