SATO: Stable Text-to-Motion Framework

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Is the Text to Motion model robust? Recent advancements in Text-to-Motion models primarily stem from more accurate predictions of specific actions. However, the text modality typically relies solely on pre-trained Contrastive Language-Image Pretraining (CLIP) models. Our research has uncovered a significant issue with the text-to-motion model: its predictions often exhibit inconsistent outputs, resulting in vastly different or even incorrect poses when presented with semantically similar or identical text inputs. In this paper, we undertake an analysis to elucidate the underlying causes of this instability, establishing a clear link between the unpredictability of model outputs and the erratic attention patterns of the text encoder module. Consequently, we introduce a formal framework aimed at addressing this issue, which we term the Stable Text-to-Motion Framework (SATO). SATO consists of three modules, each dedicated to stable attention, stable prediction, and maintaining a balance between accuracy and robustness trade-off. We present a methodology for constructing an SATO that satisfies the stability of attention and prediction. To verify the stability of the model, we introduced a new textual synonym perturbation dataset based on HumanML3D and KIT-ML. Results show that SATO is significantly more stable against synonym and other slight perturbations while keeps its high accuracy performance.
Primary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: This study addresses the issue of stability in text-to-motion tasks. We have discovered that due to the diversity of inputs, even for text inputs with similar or identical semantics, all models in this field are yielding inconsistent outputs, sometimes even catastrophic errors. This clearly constrains our ability to generate motion from text. We propose a novel Stable Text-to-Motion Framework and analyze the process of finding this stable framework. Our results demonstrate that our new framework significantly improves the stability of generating motion from text. Our work pioneers a novel direction for improving text-to-motion models, paving the way for the development of more robust models for applications in virtual environments.
Supplementary Material: zip
Submission Number: 2283
Loading