Abstract: Generating realistic, holistic motion from speech is crucial for creating believable talking avatars. While recent studies on masked motion models show promise, it struggle to accurately identify semantically significant frames and align heterogeneous speech-motion modalities effectively. In this work, we propose a speech-queried masked motion modeling framework, named EchoTalk, for identifying semantically coherent and expressive co-speech motions using speech as active queries. Our key insight is leveraging learnable motion-aduio aligned speech queries to guide the masked motion modeling process, selectively masking semantically significant motion frames, thereby enabling the model to learn more effective and semantically coherent representations. Specifically, a speech-queried attention mechanism is introduced to compute frame-level attention scores through interactions between motion keys and speech queries, guiding selective masking toward frames with high semantic value. Additionally, we design MoauAlign, a hierarchical contrastive embedding module, which projects paired speech and motion inputs into a unified latent space using low-level and high-level HuBERT features via shared transformer networks. Qualitative and quantitative evaluations confirm that our method outperforms existing state-of-the-art approaches, successfully producing high-quality co-speech motion.
Loading