Abstract: In the field of co-speech gesture generation, quantization plays a crucial role in transforming continuous gesture data into discrete tokens or codes, facilitating the alignment and semantic preservation of gestures with speech. This study proposes a semantic-aware Vector Quantized Variational Auto-Encoder (VQVAE) model designed to address the limitations of existing approaches, which often fail to maintain the contextual and semantic coherence of gestures. Our method incorporates context-aware padding and contrastive loss to improve the accuracy and consistency of gesture representations, in order to preserving their semantic integrity. Experimental results demonstrate that our semantic-aware VQVAE outperforms base-line methods across several metrics. These results underscore the potential of our approach to advance high-quality gesture quantization, facilitating more natural and coherent co-speech gesture generation.
Loading