HumanExpert: Unified Multimodal Humanoid Generation

ICLR 2026 Conference Submission15723 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Generation, Portrait Animation, Pose Animation, Large Language Model
TL;DR: HumanExpert is a unified multimodal generative framework integrating language, speech, and behavior for versatile humanoid agents.
Abstract: Though recent advances in unified multimodal understanding and generation have unfolded, building multimodal humanoid agents capable of mimicking core human abilities, such as language understanding, speech, and behavior generation, remains challenging. Symbolic modalities like language rely on discrete tokens, while perceptual modalities such as vision and behavior benefit from continuous representations, making unified understanding and generation across such diverse modalities difficult. Insightfully, by decoupling model parameters across modalities and adopting a modality-expert training strategy, we avoid degrading the original language model’s intelligence while enabling the interleaving of continuous and discrete tokens within a unified generative framework. Inspired by this, we propose HumanExpert, a unified multimodal generative model for humanoid agent tasks, synthesizing language, speech, and behavior in one interleaved autoregressive-diffusion framework with a behavior expert. Specifically, HumanExpert employs a mixture-of-experts (MoE) architecture with a modality-independent backbone, where the behavior expert enables human behavior modeling while preserving the intelligence of the pre-trained language model. Based on this MoE architecture, we design an interleaved autoregressive-diffusion framework that generates text, audio, and behavior tokens, supervising the text and audio in an autoregressive manner and the behavior modality with diffusion loss. We further implement a diffusion forcing strategy to stabilize continuous generation. As a newly emerging and comprehensive task, we carefully design a humanoid agent evaluation protocol and achieve competitive performance in language understanding, audio-behavior alignment, and behavior execution for versatile multimodal humanoid generation.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 15723
Loading