Grammar-Based Code Representation: Is It a Worthy Pursuit for LLMs?

ACL ARR 2025 February Submission7945 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract:

Grammar serves as a cornerstone in programming languages and software engineering, providing frameworks to define the syntactic space and program structure. Existing research demonstrates the effectiveness of grammar-based code representations in small-scale models, showing their ability to reduce syntax errors and enhance performance. However, as language models scale to the billion level or beyond, syntax-level errors become rare, making it unclear whether grammar information still provides performance benefits. To explore this, we develop a series of billion-scale GrammarCoder models, incorporating grammar rules in the code generation process. Experiments on HumanEval(+) and MBPP(+) demonstrate a notable improvement in code generation accuracy. Further analysis shows that grammar-based representations enhance LLMs' ability to discern subtle code differences, reducing semantic errors caused by minor variations. These findings suggest that grammar-based code representations remain valuable even in billion-scale models, not only by maintaining syntax correctness but also by improving semantic differentiation.

Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: code generation and understanding, grammar and knowledge-based approaches, pre-training
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models
Languages Studied: python programming language
Submission Number: 7945
Loading