Keywords: LLM, embodied agent, prompt engineering
Abstract: Large Language Models (LLMs) show strong natural language understanding but often fail in embodied AI settings that demand physical validity and causal reasoning. We evaluate open-source models across four tasks: Goal Interpretation, Subgoal Decomposition, Action Sequencing, and Transition Modeling. Our analysis shows that LLMs often generate full action sequences without considering intermediate environmental feedback, which leads to runtime failures. To address this issue, we encode physical constraints as declarative rules in system prompts and applies Supervised Fine-Tuning (SFT) to align the model with domain dynamics. These interventions improve physical validity, but their effectiveness varies by task. This study clarifies how prompt engineering and SFT affect embodied performance, revealing both the capabilities and the persistent constraints of current open-source models.
Submission Number: 11
Loading