Diagnosing CFG Interpretation in LLMs

Diagnosing CFG Interpretation in LLMs

ACL ARR 2026 January Submission8341 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models Evaluation, Formal Languages, In-context Learning, Symbolic Reasoning, Systematic Generalization

Abstract: As LLMs are increasingly integrated into agentic systems, they must adhere to dynamically defined, machine-interpretable interfaces. We evaluate LLMs as in-context interpreters: *given a novel context-free grammar, can LLMs generate syntactically valid, behaviorally functional, and semantically faithful outputs?* We introduce RoboGrid, a framework that disentangles **syntax**, **behavior**, and **semantics** through controlled stress-tests of recursion depth, expression complexity, and surface styles. Our experiments reveal a consistent hierarchical degradation: LLMs often maintain surface syntax but fail to preserve structural semantics. Despite the partial mitigation provided by CoT reasoning, performance collapses under structural density, specifically deep recursion and high branching, with semantic alignment vanishing at extreme depths. Furthermore, `Alien` lexicons reveal that LLMs rely on semantic bootstrapping from keywords rather than pure symbolic induction. These findings pinpoint critical gaps in hierarchical state-tracking required for reliable, grammar-agnostic agents.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Hierarchical structure prediction, Syntax-to-semantic interface, In-context learning, Inductive reasoning, Symbolic reasoning, Generalization

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 8341

Loading