AvatarGPT: All-in-One Framework for Motion Understanding, Planning, Generation and Beyond

Zixiang Zhou, Yu Wan, Baoyuan Wang

Published: 01 Jan 2024, Last Modified: 13 Nov 2024CVPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Large Language Models(LLMs) have shown remarkable emergent abilities in unifying almost all (if not every) NLP tasks. In the human motion-related realm, however, researchers still develop siloed models for each task. In-spired by InstuctGPT[16], and the generalist concept be-hind Gato [27], we introduce AvatarGPT, an All-in-One framework for motion understanding, planning, generations as well as other tasks such as motion in-between synthesis. AvatarGPT treats each task as one type of in-struction fine-tuned on the shared LLM. All the tasks are seamlessly interconnected with language as the univer-sal interface, constituting a closed-loop within the frame-work. To achieve this, human motion sequences are first encoded as discrete tokens, which serve as the extended vo-cabulary of LLM. Then, an unsupervised pipeline to gen-erate natural language descriptions of human action sequences from in-the-wild videos is developed. Finally, all tasks are jointly trained. Extensive experiments show that AvatarGPT achieves SOTA on low-level tasks, and promising results on high-level tasks, demonstrating the effectiveness of our proposed All-in-One framework. Moreover, for the first time, AvatarGPT enables a principled approach by iterative traversal of the tasks within the closed-loop for un-limited long-motion synthesis.