Asymmetric Effects of Self-Corrective Learning on Chain-of-Thought Reasoning for Efficient Policy Adaptation
Keywords: embodied agent, task adaptaton
Abstract: Recent advances in language model (LM)-powered agents have demonstrated the potential to tackle complex embodied tasks by grounding the models’ commonsense world knowledge in the interactive physical environments in which the agents operate. However, these LM-based agents' adaptation to a stream of diverse tasks over time remains challenging, particularly under limited supervision and resource constraints. In this paper, we present BiCL, an embodied task adaptation framework that addresses the problem of continual LM finetuning across diverse tasks and adaptation stages using only a small dataset per task and a small LM (i.e., with 0.5B parameters). We devise bidirectional CoT learning, which jointly optimizes chain-of-thought (CoT) reasoning and reflexive reasoning through per-task bidirectional supervision: few-shot CoT guidance and rationale-wise correction. The latter enables the model to revise its prior rationale trajectories for new tasks, while the former strengthens multi-step task-specific reasoning through minimal demonstrations. This dual optimization allows the agent to adapt more efficiently through forward knowledge transfer over time, ultimately yielding asymmetric effects by fostering robust CoT reasoning at inference without requiring explicit reflection. Furthermore, we implement rationale-wise test-time scaling, a mechanism that dynamically adjusts the depth of CoT reasoning based on the model’s confidence in actions inferred from its own rationales. Through extensive experiments on VirtualHome and ALFWorld, we demonstrate performance superiority over other LM-based planning and continual task adaptation approaches, while achieving strong efficiency in computation, data usage and model parameters.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 25026
Loading