Evaluator-Guided LLM Distillation for Embodied Agent Decision-Making

Chinmayan Pradeep; Sanjayan Pradeep Kumar Sreekala

Evaluator-Guided LLM Distillation for Embodied Agent Decision-Making

Chinmayan Pradeep, Sanjayan Pradeep Kumar Sreekala

30 Nov 2025 (modified: 08 Dec 2025)NeurIPS 2025 Workshop FMEA SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Embodied agents, Large language models, Evaluator-guided distillation, Transition Modeling, Goal Interpretation, Subgoal decomposition, Action sequencing, VirtualHome, BEHAVIOR

TL;DR: We show that with evaluator-guided LLM distillation and scaffolding, small finetuned Qwen3 models can match or beat frontier LLMs on the Embodied Agent Interface benchmark.

Abstract: We present the winning submission to the NeurIPS 2025 Embodied Agent Interface (EAI) challenge in this report. We treat each module—Goal Interpretation, Subgoal Decomposition, Action Sequencing, and Transition Modeling—as a supervised sequence-to-sequence task, and finetune small Qwen3 models on data constructed using LLM driven evaluator-in-the-loop refinement and the public EAI dataset. On BEHAVIOR, these task-specialized Qwen3 models nearly saturate all four modules and outperform a strong GPT-5-mini baseline; on VirtualHome, they substantially improve performance across the board. We also introduce a learned LLM-as-an-evaluator—a finetuned Qwen3 model that scores and refines candidate outputs—which when combined with retrieval and simple voting yields further gains. Overall, our results show that careful prompt design and evaluator-guided distillation can allow smaller open source models to match or outperform frontier LLMs on embodied reasoning tasks.

Submission Number: 9

Loading