FlashMo: Geometric Interpolants and Frequency-Aware Sparsity for Scalable Efficient Motion Generation

Zeyu Zhang; Yiran Wang; Danning Li; Dong Gong; Ian Reid; Richard Hartley

FlashMo: Geometric Interpolants and Frequency-Aware Sparsity for Scalable Efficient Motion Generation

Zeyu Zhang, Yiran Wang, Danning Li, Dong Gong, Ian Reid, Richard Hartley

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Keywords: Motion Generation, Diffusion Model

TL;DR: FlashMo introduces a geometric factorized interpolant and frequency-sparse attention, enabling scalable efficient 3D motion diffusion.

Abstract: Diffusion models have recently advanced 3D human motion generation by producing smoother and more realistic sequences from natural language. However, existing approaches face two major challenges: high computational cost during training and inference, and limited scalability due to reliance on U-Net inductive bias. To address these challenges, we propose **FlashMo**, a frequency-aware sparse motion diffusion model that prunes low-frequency tokens to enhance efficiency without custom kernel design. We further introduce *MotionSiT*, a scalable diffusion transformer based on a joint-temporal factorized interpolant with Lie group geodesics over $\mathrm{SO}(3)$ manifolds, enabling principled generation of joint rotations. Extensive experiments on the large-scale MotionHub V2 dataset and standard benchmarks including HumanML3D and KIT-ML demonstrate that our method significantly outperforms previous approaches in motion quality, efficiency, and scalability. Compared to the state-of-the-art 1-step distillation baseline, FlashMo reduces **12.9%** inference time and FID by **34.1%**. Project website: https://steve-zeyu-zhang.github.io/FlashMo.

Supplementary Material: zip

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 6198

Loading