Keywords: latent variable learning, code generation, reasoning, synthetic data generation
TL;DR: We present a method to model observed text data with a learned latent space of code programs.
Abstract: Modern language modeling datasets require models to handle system-2 compositional reasoning, fact recall, and task-specific constraints. While these tasks are expressed in natural language, they often imply an underlying smbolic representation. In this work, we consider methods for extracting a latent symbolic representation in an unsupervised manner.
We introduce a latent variable modeling approach that models observed data as being generated by from a latent generative representation: an executable code program. Code as the latent symbolic representation offers two key advantages.
First, code offers a structured space that can be explored via modular functions; second, code is interpretably executable using deterministic and neural interpreters, enabling compositional and programmatic decoding into text. By identifying and composing patterns in this latent space, we can sample programs that produce correct, diverse, and task-relevant text through program execution.
We demonstrate how our method induces a latent space with modern LLMs, explore patterns discovered within it, and evaluate text data synthesized from our induced latent space.
Submission Number: 61
Loading