everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
Large Language Models (LLMs) have demonstrated strong reasoning capabilities and possess extensive common knowledge. This enables them to adapt to a variety of complex tasks in a zero-shot manner, including functioning as controllers to manipulate automated systems and produce executable action sequences. However, a significant challenge in the existing framework is the misalignment between the general pre-trained LLM and the action space of specific control tasks. This misalignment necessitates extensive efforts in designing task-specific prompts, which are less generalizable and do not ensure consistent output when prompting a pre-trained LLM to generate the desired action sequences. To address this issue, we propose a novel solution, ActionVerse, which encodes action candidates into a series of modality tokens, coupled with an efficient alignment technique to synchronize the action tokens with the LLM's language space. By leveraging this approach, the proposed ActionVerse successfully transforms a chat-based multi-modal LLM into a general action executor capable of handling tasks requiring step-by-step execution of various actions. Experiments on several sequential action tasks demonstrate the effectiveness of the proposed framework.