Abstract: In the literature, existing studies on text-to-motion generation (TMG) routinely focus on exploring the objective alignment of text and motion, which largely ignore the subjective emotion information, especially the limb-level emotion information. With this in mind, this paper proposes a new Emotion-enriched Text-to-Motion Generation (ETMG) task, aiming to generate motions with the subjective emotion information. Further, this paper believes that injecting emotions into limbs (named intra-limb emotion injection) and ensuring the coordination and coherence of emotional motions after injecting emotion information (named inter-limb emotion disturbance) is rather important and challenging in this ETMG task. To this end, this paper proposes an LLM-guided Limb-level Emotion Manipulating (${\rm L^{3}EM}$) approach to ETMG. Specifically, this approach designs an LLM-guided intra-limb emotion modeling block to inject emotion into limbs, followed by a graph-structured inter-limb relation modeling block to ensure the coordination and coherence of emotional motions. Particularly, this paper constructs a coarse-grained Emotional Text-to-Motion (EmotionalT2M) dataset and a fine-grained Limb-level Emotional Text-to-Motion (Limb-ET2M) dataset to justify the effectiveness of the proposed ${\rm L^{3}EM}$ approach. Detailed evaluation demonstrates the significant advantage of our ${\rm L^{3}EM}$ approach to ETMG over the state-of-the-art baselines. This justifies the importance of the limb-level emotion information for ETMG and the effectiveness of our ${\rm L^{3}EM}$ approach in coherently manipulating such information.
Primary Subject Area: [Engagement] Emotional and Social Signals
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: This paper introduces a multimodal task called Emotion-enriched Text-to-Motion Generation (ETMG) task, which represents the first attempt to consider diverse and subjective emotion information in motion generation. To tackle the task, this paper proposes an LLM-guided Limb-level Emotion Manipulating approach which contains an LLM-guided Intra-limb Emotion Modeling (LEM) block and a Graph-structured Inter-limb Relation Modeling (GRM) block. Specifically, the LEM block injects emotional information into limbs through large language models (LLM) and the GRM block captures limb spatial position information via a limb relation graph.
Furthermore, we construct a coarse-grained Emotional Text-to-Motion dataset and a fine-grained Limb-level Emotional Text-to-
Motion dataset for the ETMG task. Extensive Experimental results on these datasets demonstrate the superior performance of our approach over the state-of-the-art baselines.
The contributions of our work to multimedia/multimodal processing are as follows:
1.We are the first to consider the emotion information in the text-to-motion task, and propose a new Emotion-enriched Text-to-Motion Generation (ETMG) task.
2.Our approach is the first attempt to combine LLMs with diffusion models for motion generation task.
3.Our LEM block and GRM block provide new tools and methods for multimodal processing.
Supplementary Material: zip
Submission Number: 4260
Loading