CtrlAct: Grounding LLMs to Bridge the Gap between Embodied Instruction and Action

Qingyang Xiao; Bo Su; Ling Sun; Zhu Zhu; Thai Le

CtrlAct: Grounding LLMs to Bridge the Gap between Embodied Instruction and Action

Qingyang Xiao, Bo Su, Ling Sun, Zhu Zhu, Thai Le

30 Nov 2025 (modified: 08 Dec 2025)NeurIPS 2025 Workshop FMEA SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, embodied agent, prompt engineering

Abstract: Large Language Models (LLMs) show strong natural language understanding but often fail in embodied AI settings that demand physical validity and causal reasoning. We evaluate open-source models across four tasks: Goal Interpretation, Subgoal Decomposition, Action Sequencing, and Transition Modeling. Our analysis shows that LLMs often generate full action sequences without considering intermediate environmental feedback, which leads to runtime failures. To address this issue, we encode physical constraints as declarative rules in system prompts and applies Supervised Fine-Tuning (SFT) to align the model with domain dynamics. These interventions improve physical validity, but their effectiveness varies by task. This study clarifies how prompt engineering and SFT affect embodied performance, revealing both the capabilities and the persistent constraints of current open-source models.

Submission Number: 11

Loading