Multi-Scale Memory Fusion with Dynamic Decay for Coherent Text-to-Motion Generation

Multi-Scale Memory Fusion with Dynamic Decay for Coherent Text-to-Motion Generation

ICLR 2026 Conference Submission19593 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Motion Generation, Exponential Decay

TL;DR: This paper introduces a novel temporal modeling approach that significantly improves the coherence and quality of text-to-3D human motion generation by leveraging hierarchical context and adaptive temporal attention.

Abstract: Text-to-3D human motion generation has emerged as a critical challenge in human-AI interaction, with transformative applications spanning virtual reality, robotic control and digital content creation. While recent advances in diffusion models and transformer architectures have significantly improved motion quality, we identify two fundamental limitations that persist in state-of-the-art methods: (1) suboptimal utilization of multi-scale historical context leading to motion discontinuity, and (2) uniform temporal weighting that fails to capture phase-dependent feature importance in complex motion sequences. To address these challenges, we propose FADM (Feedback-Augmented Decay Motion Model), a novel framework that introduces three key innovations: a hierarchical memory fusion module with learnable scale adapters for preserving both local kinematics and global action semantics, an exponentially decaying temporal attention mechanism grounded in human motion dynamics, and a semantic-consistent autoregressive feedback loop ensuring long-range coherence. Extensive experiments demonstrate our method's state-of-the-art performance, achieving a 22.2% FID reduction on HumanML3D, 64.7% improvement in Top-1 accuracy, and 30.9% better generalization on KIT-ML, while maintaining competitive motion diversity (Multimodality score: 1.283±0.044). Beyond its immediate applications, FADM establishes a new paradigm for temporal modeling that can potentially benefit various conditional generation tasks including video synthesis and robotic motion planning.

Primary Area: generative models

Submission Number: 19593

Loading