When More Data Hurts: A Troubling Quirk in Developing Broad-Coverage Natural Language Understanding Systems
Abstract: In natural language understanding (NLU) production systems, the end users' evolving needs necessitate the addition of new abilities, indexed by discrete symbols, requiring additional training data and resulting in dynamic, ever-growing datasets.Dataset growth introduces new challenges: we find that when learning to map inputs to a new symbol from a fixed number of annotations, more data can in fact {\emph{reduce}} the model's performance on examples that involve this new symbol.We show that this trend holds for multiple models on two datasets for common NLU tasks: intent recognition and semantic parsing.We demonstrate that the performance decrease is largely associated with an effect we refer to as source signal dilution, which occurs when strong lexical cues in the training data become diluted as the dataset grows.Selectively dropping training examples to prevent source dilution often reverses the performance decrease, suggesting a direction for improving models.We release our code and models at \url{anonymous-link}.
0 Replies
Loading