Data Checklist: On Unit-Testing Datasets with Usable Information

Heidi Chenyu Zhang; Shabnam Behzad; Kawin Ethayarajh; Dan Jurafsky

Data Checklist: On Unit-Testing Datasets with Usable Information

Heidi Chenyu Zhang, Shabnam Behzad, Kawin Ethayarajh, Dan Jurafsky

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0

Research Area: Alignment, Data, Evaluation

Keywords: data checklist, usable information, dataset artifact, preference alignment

TL;DR: We propose a data checklist consisting of 10 unit tests that diagnose dataset artifacts and ground model behavior in the data they are trained on.

Abstract: Model checklists (Ribeiro et al., 2020) have emerged as a useful tool for understanding the behavior of LLMs, analogous to unit-testing in software engineering. However, despite datasets being a key determinant of model behavior, evaluating datasets -- e.g., for the existence of annotation artifacts -- is largely done ad hoc, once a problem in model behavior has already been found downstream. In this work, we take a more principled approach to unit-testing datasets by proposing a taxonomy based on the $\mathcal{V}$-information literature. We call a collection of such unit tests a data checklist. Using the checklist, not only are we able to recover known artifacts in well-known datasets such as SNLI, but we also discover previously unknown artifacts in preference datasets for LLM alignment. Data checklists further enable a new kind of data filtering, which we use to improve the efficacy and data efficiency of preference alignment.

Supplementary Material: zip

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 1131

Loading