Keywords: Embodied agents, Large language models, Evaluator-guided distillation, Transition Modeling, Goal Interpretation, Subgoal decomposition, Action sequencing, VirtualHome, BEHAVIOR
TL;DR: We show that with evaluator-guided LLM distillation and scaffolding, small finetuned Qwen3 models can match or beat frontier LLMs on the Embodied Agent Interface benchmark.
Abstract: We present the winning submission to the NeurIPS 2025 Embodied Agent Interface (EAI) challenge in this report. We treat each module—Goal Interpretation, Subgoal Decomposition, Action Sequencing, and Transition Modeling—as a supervised sequence-to-sequence task, and finetune small Qwen3 models on data constructed using LLM driven evaluator-in-the-loop refinement and the public EAI dataset. On BEHAVIOR, these task-specialized Qwen3 models nearly saturate all four modules and outperform a strong GPT-5-mini baseline; on VirtualHome, they substantially improve performance across the board. We also introduce a learned LLM-as-an-evaluator—a finetuned Qwen3 model that scores and refines candidate outputs—which when combined with retrieval and simple voting yields further gains. Overall, our results show that careful prompt design and evaluator-guided distillation can allow smaller open source models to match or outperform frontier LLMs on embodied reasoning tasks.
Submission Number: 9
Loading