Track: long paper (up to 10 pages)
Keywords: LLM, scientific reasoning, out of distrubution
Abstract: CFLBench evaluates whether tool-augmented language models can acquire the execution semantics of \emph{previously unseen programming languages} from a handful of examples and black-box execution feedback. We construct 19 small deterministic languages that share a compact, assembly-like surface vocabulary but vary the underlying control-flow mechanism. Each language provides four tasks: Tasks 1--2 emphasize direct generalization from examples and are defined as \textbf{passive}, while Tasks 3--4 are designed to benefit from active experimentation via probe programs and iterative hypothesis refinement and are defined as \textbf{active}. Across a range of state-of-the-art models, we observe a consistent gap between passive and active tasks. The strongest model we evaluate, GPT 5.2, achieves 94.7\% on passive tasks but drops to 60.5\% on active tasks. Comparatively, other models such as Claude Opus 4.5 (94.7\%$\to$42.1\%), Gemini-3-Pro (84.2\%$\to$31.6\%) degrade more sharply. This pattern suggests that learning new execution semantics through interaction is a distinct bottleneck, beyond simply learning semantic patterns from examples. We release code: \url{https://anonymous.4open.science/r/CFLBench-C01E/}. We release the tasks and interpreters to support fine-grained behavioral analysis of inductive strategies and failure modes.
Presenter: ~Aaroosh_Rustagi1
Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Submission Number: 103
Loading