Abstract: In the era of big models for art creation, 3D human motion generation has become a crucial research direction, with Vector Quantized Variational Auto-Encoders (VQ-VAEs) playing a pivotal role in bridging modalities for cross-modal tasks. This paper introduces an Anatomically-Informed VQ-VAE, designed to leverage the inherent structure of the human body, a key yet previously underutilized bridge in this domain. The proposed method enhances performance by partitioning motion data into anatomically meaningful subgroups, allowing for the learning of expressive and semantically meaning-ful latent representations. The significance of this approach is twofold: it not only demonstrates state-of-the-art performance on the KIT dataset, but also underscores the necessity of integrating isomorphic components, those with shared structures across different modalities, into the design of cross-modal tasks. The emphasis on isomorphism paves the way for a deeper understanding of how to effectively map between modalities in AI-driven art generation, opening new avenues for future research.
Loading