Teaching Code Execution to Tiny Language Models

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Code Language Models, Tiny Language Models, Code Execution
TL;DR: TEX, a 15M parameter language model, is trained on random code snippets in a custom Turing-complete language using next-token prediction. It achieves 99.13% accuracy in code execution, demonstrating that tiny language models can learn this task.
Abstract: Recent advancements in large language models have demonstrated their effectiveness in various tasks. However, the question of these models' limitations remains open though. For instance, can a language model learn to perform code execution (i.e., predicting the output of code)? Current research indicates that the performance of state-of-the-art large language models in code execution is still limited. The reasons for this limitations are unclear though. Is it due to fundamental constraints or other factors such as training data and computational resources? Is the next-token prediction objective sufficient for learning code execution? How small can a language model be while still capable of learning code execution? In this paper, we investigate these questions. More specifically, we investigate whether tiny language models, trained from scratch using the next-token prediction objective, can effectively learn to execute code. Our experiments show that, given appropriate data, model size, and computational resources, tiny language models can indeed learn to perform code execution with a 99.13% accuracy for a tiny Turing-complete programming language. We begin by defining a tiny programming language called TinyPy. Millions of randomly generated codes in this language, along with their outputs, are used to train our tiny language models using the next-token prediction task. We then conduct a series of experiments to determine the smallest model size, data amount, and computational resources necessary to train our language model to achieve near-perfect accuracy in code execution. Our findings reveal that TEX, our proposed tiny language model with 15M parameters, can successfully learn code execution. This suggests that a task as complex as predicting code output is within the reach of language models.
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4371
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview