Keywords: Rule following, Myhill-Nerode boundary, Language Theory
Abstract: There are many instances in which we might want a language model to follow rules, including grammatical constraints and the avoidance of hateful or toxic statements. Following rules necessitates, first, keeping track of states where the rule applies, and second, following the rule when in these states. We introduce a formal framework for evaluating rule-following in language models, focusing on rules that can be expressed using deterministic finite automata (DFAs). We say a language model $(\epsilon, \delta)$-matches a DFA $A$ if 1) it behaves similarly on all prefixes that lead to the same state in the DFA (up to divergence level $\epsilon$) and 2) it distinguishes between prefixes that lead to different DFA states by exhibiting at least $\delta$ compliance over the set of suffixes that are treated differently by the automaton. To formalize this definition, we define probability distributions over the Myhill-Nerode Boundary (MNB) to measure a model's ability to distinguish between valid and invalid suffixes. Our approach defines rule compliance in probabilistic terms, offering a principled way to evaluate model behavior in light of specified language-generation constraints. We implement an experimental setup in which we train transformer models on synthetic languages and evaluate the compliance of their outputs. We discuss generalizations of our framework -- including to non-regular languages -- and the implications for evaluating and controlling language models.
Submission Number: 18
Loading