MOGO: Residual Quantized Hierarchical Causal Transformer for Real-Time and Infinite-Length 3D Human Motion Generation

Published: 20 Jan 2026, Last Modified: 09 Mar 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0
Abstract: Recent advances in transformer-based text-to-motion generation have significantly improved motion quality. However, achieving both real-time performance and long-horizon scalability remains an open challenge. In this paper, we present MOGO (Motion Generation with One-pass), a novel autoregressive framework for efficient and scalable 3D human motion generation. MOGO consists of two key components. First, we introduce MoSA-VQ, a motion scale-adaptive residual vector quantization module that hierarchically discretizes motion sequences through learnable scaling parameters, which dynamically regulate the information flow at each layer to produce compact yet expressive multi-level representations. Second, to fully utilize the high-quality motion representations, we further design the RQHC-Transformer, a residual quantized hierarchical causal transformer that structurally aligns with the multi-level latent hierarchy produced by MoSA-VQ. Each level is decoded by a dedicated transformer block, enabling efficient multi-scale generation in a single forward pass. Compared to diffusion-based and LLM-based approaches, it achieves lower inference latency while maintaining high motion quality. Notably, our hierarchical latent modeling—through the synergy of MoSA-VQ and RQHC-Transformer—empowers MOGO with seamless and coherent infinite-length generation. By iteratively extending motion from any given frame and allowing control signals to be updated at arbitrary points, the model produces stable transitions and responds adaptively to new conditions, enabling real-time, controllable long-horizon synthesis with strong temporal consistency. Extensive experiments on HumanML3D and KIT-ML validate the quality and efficiency of our approach, while evaluation on the unseen CMP dataset demonstrates strong zero-shot generalization capabilities.
Loading