Modal-Enhanced Semantic Modeling for Fine-Grained 3D Human Motion Retrieval

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Text to Motion Retrieval (TMR) is an emerging task to retrieval relevant motion sequence with the nature language description. The existing dominant approach is to learn a joint embedding space to measure global-level similarities. However, simple global embeddings are insufficient to represent complicated motion and textual details, such as the movement of specific body parts and the coordination among these body parts. In addition, most of the motion variations occur subtly and locally, resulting in semantic vagueness amony these motions, which further presents considerable challenges in precisely aligning motion sequences with texts. To address these challenges, we propose a novel Modal-Enhanced Semantic Modeling (MESM) method, focusing on fine-grained alignment through enhanced modal semantics. Specifically, we develop a prompt-enhanced textual module (PTM) to generate detailed descriptions of specific body part movements, which comprehensively captures the fine-grained textual semantics for precise matching. We employ a skeleton-enhanced motion module (SMM) to effectively enhance the model's capability to represent intricate motions. This module leverages a graph convolutional network to meticulously model the intricate spatial dependencies among relevant body parts. To improve the sensitivity to the subtle motions, we further propose a text-driven semantics interaction module (TSIM). The TSIM first assigns motion features into a set of aggregated descriptors, then employs a cross-attention to aggregate discriminative motion embeddings guided by textual, enabling precise semantic alignment between subtle motions and corresponding texts. Extensive experiments conducted on two widely used benchmark datasets, HumanML3D and KIT-ML, demonstrate the effectiveness of our proposed method. Our approach outperforms existing state-of-the-art retrieval methods, achieving significant Rsum improvements of 24.28\% on HumanML3D and 25.80\% on KIT-ML.
Primary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: In this work, we focus on an emerging task of text-to-3D human motion retrieval. We thoroughly examine the challenges specific to motion sequence retrieval and the limitations of existing methods. To address these issues, we propose a novel modal-enhanced fine-grained alignment framework to achieve precise matching between motion sequences and text descriptions. Specifically, we introduce feature enhancement modules, which employ a large language model (LLM) and a graph convolutional network (GCN) to enhance the representation of textual and motion modalities,respectively. Furthermore, we develop a cross-modal semantic interaction module, facilitating more precise semantic alignment under the guidance of textual information. We have achieved significant improvements in the text-to-3D human motion retrieval task.
Submission Number: 5129
Loading