TL;DR: We fix model bugs post training using patches parameterized as language strings
Abstract: The de-facto standard for fixing bugs in models post training is to finetune the model on additional annotated data, or patch the model with tenuous if-else rules. In contrast, humans can often use natural language as a tool for providing corrective feedback to each other. In this work, we explore using natural language patches from users to fix bugs in NLP models. Our overall approach uses a gating head to softly combine the original model output with a patch-conditioned output from an interpreter head. Both of these heads are trained by inserting a patch finetuning stage between training and deployment, where the training objective is based on synthetically generated inputs and patches. Surprisingly, we show that this synthetic patch training phase is enough to enable patching inputs on real data---on two data slices from a sentiment analysis dataset, we show that 1 to 5 language patches can improve performance by ~1-4%. Next, on an adversarial relation extraction diagnostic test set, we improve F1 by over 30% with just 6 patches.
Track: Non-Archival (will not appear in proceedings)
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:2211.03318/code)