Abstract: The capacity of existing human keypoint localization models is limited by keypoint priors provided by the training data. To alleviate this restriction and pursue more gen-eral model, this work studies keypoint localization from a different perspective by reasoning locations based on key-piont clues in text descriptions. We propose LocLLM, the first Large-Language Model (LLM) based keypoint local-ization model that takes images and text instructions as in-puts and outputs the desired keypoint coordinates. LocLLM leverages the strong reasoning capability of LLM and clues of keypoint type, location, and relationship in textual de-scriptions for keypoint localization. To effectively tune Lo-cLLM, we construct localization-based instruction conver-sations to connect keypoint description with corresponding coordinates in input image, and fine-tune the whole model in a parameter-efficient training pipeline. LocLLM shows remarkable performance on standard 2D/3D keypoint lo-calization benchmarks. Moreover, incorporating language clues into the localization makes LocLLM show superior flexibility and generalizable capability in cross dataset key-point localization, and even detecting novel type of key-points unseen during training † † Project page: https://github.com/kennethwdk/LocLLM.
Loading