LLMs are Good Action Recognizers

Haoxuan Qu, Yujun Cai, Jun Liu

Published: 01 Jan 2024, Last Modified: 19 May 2025CVPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Skeleton-based action recognition has attracted lots of research attention. Recently, to build an accurate skeleton-based action recognizer, a variety of works have been pro-posed. Among them, some works use large model architectures as backbones of their recognizers to boost the skeleton data representation capability, while some other works pre-train their recognizers on external data to enrich the knowl-edge. In this work, we observe that large language models which have been extensively used in various natural language processing tasks generally hold both large model ar-chitectures and rich implicit knowledge. Motivated by this, we propose a novel LLM-AR framework, in which we in-vestigate treating the Large Language Model as an Action Recognizer. In our framework, we propose a linguistic pro-jection process to project each input action signal (i.e., each skeleton sequence) into its “sentence format” (i.e., an “action sentence”). Moreover, we also incorporate our frame-work with several designs to further facilitate this linguistic projection process. Extensive experiments demonstrate the efficacy of our proposed framework.