Learning Latent Causal Semantics from Text: An Empirical Study of Next-Token Predictors Trained on Programs

Charles Jin; Martin Rinard

Learning Latent Causal Semantics from Text: An Empirical Study of Next-Token Predictors Trained on Programs

Charles Jin, Martin Rinard

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: language modeling, program synthesis, formal semantics

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: We present evidence that language models can learn to represent the semantics latent in text despite being trained only to perform next token prediction. Specifically, we train a Transformer model on a synthetic corpus of programs written in a domain-specific language for navigating 2D grid world environments. Each program in the corpus is preceded by a (partial) specification in the form of (textual) input-output examples, and hence the semantics of the programming language enter as a \emph{latent causal variable} in the data generation process. We then probe the trained model's hidden states as it generates a program given a specification. Despite providing no inductive bias toward learning the semantics of the programming language, we find that a linear probe is able to extract abstractions of the program states from the model states, which suggests the model acquires an emergent ability to \emph{interpret} programs in the formal sense. Moreover, there is a strong, statistically significant correlation between the accuracy of the probe and the model's ability to generate a program that correctly implements the specification. To evaluate whether the semantics are represented in the model states rather than learned by the probe, we propose a causal framework for analyzing the effects of probing, and perform interventional experiments that allow us to precisely attribute the accuracy of the probe to the semantics latent in the model's training data (rather than, e.g., the signal used to supervise the probe). In summary, this paper does not propose any new techniques for training language models, but develops an empirical framework for and provides insights into the acquisition and representation of semantics in language models.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7509

Loading