Do Programmatic Policies Really Generalize Better? A Re-Evaluation Study

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: programmatic reinforcement learning, program synthesis
TL;DR: We revisit claims from the literature that programmatic policies generalize better than neural ones and show that much of the observed gap arises from uncontrolled experimental factors rather than intrinsic representational reasons.
Abstract: Programmatic policies are often reported to generalize better than neural policies in reinforcement learning (RL) benchmarks. We revisit some of these claims and show that much of the observed gap arises from uncontrolled experimental factors rather than intrinsic representational reasons. Re-evaluating three core benchmarks used in influential papers---TORCS, Karel, and Parking---we find that neural policies, when trained with a few modifications, such as sparse observations and cautious reward functions, can match or exceed the out-of-distribution (OOD) generalization of programmatic policies. We argue that a representation enables OOD generalization if (i) the policy space it induces includes a generalizing policy and (ii) the search algorithm can find it. The neural and programmatic policies in prior work are comparable in OOD generalization because the domain-specific languages used induce policy spaces similar to those of neural networks, and our modifications help the gradient search find generalizing solutions. However, resolving these confounds does not address the deeper question of when programmatic representations provide an inherent OOD generalization advantage over neural ones. We provide an answer to this question by focusing on problems whose solutions require working memory that grows with input size. Commonly used neural architectures cannot encode a solution to this type of problem due to their fixed-capacity design. We show that navigation tasks, such as pathfinding, and domains with nested subproblems fall into this category. As a proof of concept, we construct a modified Karel task for which simple constant-memory heuristics such as wall-following strategies cannot solve, and demonstrate that FunSearch can synthesize an implementation of breadth-first search that provably generalizes to OOD. By disentangling representational factors from experimental confounds, we advance our understanding of what makes a representation succeed or fail at OOD generalization. We hope our results will help guide the design of experiments and representations for advancing OOD generalization.
Primary Area: reinforcement learning
Submission Number: 8892
Loading