FlashMo: Geometric Interpolants and Frequency-Aware Sparsity for Scalable Efficient Motion Generation

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
Keywords: Motion Generation, Diffusion Model
TL;DR: FlashMo introduces a geometric factorized interpolant and frequency-sparse attention, enabling scalable efficient 3D motion diffusion.
Abstract: Diffusion models have recently advanced 3D human motion generation by producing smoother and more realistic sequences from natural language. However, existing approaches face two major challenges: high computational cost during training and inference, and limited scalability due to reliance on U-Net inductive bias. To address these challenges, we propose **FlashMo**, a frequency-aware sparse motion diffusion model that prunes low-frequency tokens to enhance efficiency without custom kernel design. We further introduce *MotionSiT*, a scalable diffusion transformer based on a joint-temporal factorized interpolant with Lie group geodesics over $\mathrm{SO}(3)$ manifolds, enabling principled generation of joint rotations. Extensive experiments on the large-scale MotionHub V2 dataset and standard benchmarks including HumanML3D and KIT-ML demonstrate that our method significantly outperforms previous approaches in motion quality, efficiency, and scalability. Compared to the state-of-the-art 1-step distillation baseline, FlashMo reduces **12.9%** inference time and FID by **34.1%**. Project website: https://steve-zeyu-zhang.github.io/FlashMo.
Supplementary Material: zip
Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)
Submission Number: 6198
Loading