Abstract: We introduce ChatPose, a framework employing Large Language Models (LLMs) to understand and reason about 3D human poses from images or textual descriptions. Our work is motivated by the human ability to intuitively under-stand postures from a single image or a brief description, a process that intertwines image interpretation, world knowl-edge, and an understanding of body language. Traditional human pose estimation and generation methods often op-erate in isolation, lacking semantic understanding and rea-soning abilities. ChatPose addresses these limitations by embedding SMPL poses as distinct signal tokens within a multimodal LLM, enabling the direct generation of 3D body poses from both textual and visual inputs. Leveraging the powerful capabilities of multimodal LLMs, ChatPose uni-fies classical 3D human pose and generation tasks while offering user interactions. Additionally, ChatPose empow-ers LLMs to apply their extensive world knowledge in rea-soning about human poses, leading to two advanced tasks: speculative pose generation and reasoning about pose esti-mation. These tasks involve reasoning about humans to generate 3D poses from subtle text queries, possibly ac-companied by images. We establish benchmarks for these tasks, moving beyond traditional 3D pose generation and estimation methods. Our results show that ChatPose out-performs existing multimodal LLMs and task-specific meth-ods on these newly proposed tasks. Furthermore, Chat-Pose's ability to understand and generate 3D human poses based on complex reasoning opens new directions in human pose analysis. Code and data are available for research at https://yfeng95.github.io/ChatPose.
Loading