Policy Learning with a Language Bottleneck

TMLR Paper6178 Authors

12 Oct 2025 (modified: 21 Dec 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Modern AI systems such as self-driving cars and game-playing agents achieve superhuman performance. But they often lack human-like generalization, interpretability, and inter- operability with human users. This paper introduces *Policy Learning with a Language Bottleneck* (PLLB), a framework enabling AI agents to generate linguistic rules that capture the high-level strategies underlying rewarding behaviors. PLLB alternates between a *rule generation* step guided by language models, and an *update* step where agents learn new policies guided by rules. Crucially, PLLB enables this kind of language-guided learning even when a natural language rule is insufficient to completely describe the target policy. Across five diverse tasks, including a two-player signaling game, maze navigation, image reconstruction, and robot grasp planning, we show that PLLB learns more interpretable and generalizable behaviors than standard policy learning methods. In three additional human subject studies, we show that show the learned rules significantly improve human task performance, enabling more effective human-AI coordination
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: ## Summary of Revisions (by Section) ### Introduction - Clarified the core unifying idea behind PLLB: contrastive reward-based rule extraction using an LLM, followed by iterative conditioning of policy updates on rules. - Explicitly stated that the generalization experiments focus on few-shot generalization. ### Section 3: Method - Added a paragraph clarifying the pitch and unifying mechanism of PLLB. - Explained how rule conditioning differs across agents (e.g., Q-regularization vs. prompt conditioning). - Revised the description of the update step to clarify how PLLB accommodates different learning agents (e.g., RL policies, VLMs/LLMs). ### Section 6: Experiments - Updated the discussion to better contextualize the few-shot generalization setting and experimental goals. ### Section 8: Generalization Analysis - Clarified that generalization results focus on few-shot generalization rather than zero-shot or out-of-distribution generalization. ### Section 9: Limitations and Discussion - Expanded discussion to address reviewer concerns, including: - Non-linguistic abstractions - Reward misspecification - Stochastic environments - Scalability to complex domains - Risks of flawed rule generation in safety-critical settings ### Figure 2 - Updated Figure 2 to improve clarity on the diversity of environments studied. ### Appendix B.10: Rule Quality and Robustness - Added sensitivity analysis on prompt variation and sampling temperature for rule generation. ### Appendix B.11: Contrastive Ablation - Added a non-contrastive ablation showing that using only high-reward episodes leads to lower performance, motivating the contrastive design of PLLB. ### Minor Revisions - Made additional small clarifications and edits throughout the paper, as detailed in the individual reviewer responses.
Assigned Action Editor: ~Erin_J_Talvitie1
Submission Number: 6178
Loading