Abstract: Current approaches to testing and debugging NLP models rely on highly variable human creativity and extensive labor, or only work for a very restrictive class of bugs. We present AdaTest, a process for adaptive testing and debugging of NLP models inspired by the test-debug cycle in traditional software engineering. AdaTest encourages a partnership between the user and a large language model (LM): the LM proposes tests that are validated and organized by the user, who in turn gives feedback and steers the LM towards better tests. Once enough bugs are discovered, these are fixed (e.g. finetuning), and the user resumes testing. In experiments with expert and non-expert users and commercial / research models for 8 different tasks, AdaTest makes users 5-10x more effective at finding bugs than current approaches, and helps users effectively fix bugs without adding new bugs.
0 Replies
Loading