CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models

Published: 07 May 2025, Last Modified: 29 May 2025VisCon 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Category-Agnostic Pose Estimation
Abstract: Category-agnostic pose estimation (CAPE) has traditionally relied on support images with annotated keypoints. This process is often cumbersome and may fail to fully capture the necessary correspondences across diverse object categories. Recent efforts have explored the use of text queries, leveraging their enhanced stability and generalization capabilities. However, existing approaches still fall short in their versatility and expressivity, with dependence on additional support queries, suboptimal use of language priors, and simplistic parametric distributions. To address these limitations, we introduce CapeLLM, the first multimodal large language model (MLLM) designed for CAPE. Our method is completely support-free, necessitating only the detailed text description of the keypoint, along with the query image. For seamless adoption of MLLM to CAPE, we propose effective training strategies and carefully designed instructions, along with inference mechanisms to enhance the visual reasoning process for unseen keypoints. Furthermore, naturally due to the design, CapeLLM is capable of modeling the underlying spatial distribution and uncertainty, allowing for adaptive refinement based on contextual cues. Above all the advantages, we set the new state-of-the-art on the MP-100 benchmark, surpassing 5-shot settings of previous art even in the 1-shot setting.
Submission Number: 27
Loading