The Next Symbol Prediction Problem: PAC-learning and its relation to Language Models

Published: 07 Nov 2023, Last Modified: 13 Dec 2023M3L 2023 PosterEveryoneRevisionsBibTeX
Keywords: PAC learning, Regular languages, language models, DFA
Abstract: The *next symbol prediction* (NSP) problem has been widely used to empirically evaluate the performance of neural sequence models on formal language tasks. We formalize the setting so as to make it amenable to PAC-learning analysis. In the NSP setting, a learning algorithm receives valid sequences (positive examples) from the underlying language, along with rich labels indicating for every prefix, whether the prefix is in the language and what symbols could appear subsequently that lead to accepting string. In the conventional classification setting where learning occurs with only positive and negative examples, the problem of learning regular languages or even subclasses represented by acyclic DFAs is known to be computationally hard based on cryptographic assumptions. In contrast, our main result shows that regular languages are efficiently PAC-learnable in the next symbol prediction setting. Further, we provide a more efficient learning algorithm for the case where the target DFA is known to be acyclic. Given the rich labels required in the NSP setting, one may wonder whether this setting is applicable to non-artificial tasks. We explain how language models can act as a source of such labeled data, and consequently, our algorithm can be applied to fit a finite-state model (DFA) that learns the (truncated) support of the language model.
Submission Number: 26