Abstract: Text-motion retrieval (TMR) is a significant cross-modal task, aiming at retrieving semantically similar motion sequences with the given query text. Existing studies primarily focus on representing and aligning the text and motion sequence with the single embeddings. However, in the real-world, the motion sequence usually consists of multiple atomic motions with complicated semantics. This simple approach may hardly capture the complex relation and abundant semantics in the text and motion sequence. In addition, most atomic motions may co-occur and be coupled together, which further brings considerable challenges in modeling and aligning query and motion sequences. In this paper, we regard TMR as a multi-instance multi-label learning (MIML) problem, where the motion sequence is viewed as a bag of atomic motion and the text is the bag of corresponding phrase description. To address the MIML problem, we propose a novel multi-granularity semantics interaction (MGSI) approach to capture and align the semantics of text and motion sequences at various scales. Specifically, the MGSI first decomposes the query and motion sequences into three levels of event (bag), actions (instances), and entities, respectively. After that, we adopt graph neural networks (GNNs) to explicitly model their semantic correlation and perform semantic interaction at corresponding scales to align text and motion. In addition, we introduce a co-occurred motion mining approach that adopts the semantic consistency between the atomic motions as measurement to identify the co-occurred atomic motions.
These co-occurred atomic motions are fused and interacted with corresponding text to achieve a precise cross-modal alignment. We evaluate our methods on the HumanML3D and KIT-ML datasets, in which we achieve improvements in Rsum of 23.09\% on HumanML3D and 21.84\% on KIT-ML.
Primary Subject Area: [Engagement] Multimedia Search and Recommendation
Relevance To Conference: Text-motion retrieval (TMR) is a significant cross-modal task, aiming at retrieving semantically similar motion sequences with the given query text. In this paper, we regard TMR as a multi-instance multi-label learning (MIML) problem, where the motion sequence is viewed as a bag of atomic motion and the text is the bag of corresponding phrase description, and propose a novel multi-granularity semantics interaction (MGSI) approach to capture and align the semantics of text and motion sequences at various scales. Experimental results demonstrate that our method outperforms the state-of-the-art methods, with 26.09% Rsum on HumanML3D and 21.84% on KIT-ML.
Submission Number: 443
Loading