Keywords: Dataset Bias, Generalization, Real-time Feedback, Crowdworker Education, Robustness, Data Quality, Visualization, Data Artifacts
TL;DR: A benchmark creation paradigm that educates data creators by providing real-time visual feedback
Abstract: Recent research has shown that language models exploit `artifacts' in benchmarks to solve tasks, rather than truly learning them. Considering that this behavior inflates model performance, shouldn't the creation of better benchmarks be our priority? In pursuit of this, we focus on guiding crowdworkers, an under-explored facet of addressing benchmark idiosyncrasies. We propose VAIDA, a novel benchmark creation paradigm for NLP. VAIDA provides realtime visual feedback to both crowdworkers and backend analysts on both sample and dataset quality, and aims to educate them on the same. VAIDA also facilitates sample correction to improve quality via recommendations. VAIDA is domain, model, task, and metric agnostic, and constitutes a paradigm shift for robust, validated, and dynamic benchmark creation via human-and-metric-in-the-loop workflows. We demonstrate VAIDA's effectiveness by leveraging a state-of-the-art data quality metric DQI over four datasets. We further evaluate via expert review and a user study with NASA TLX. We find that VAIDA decreases effort, frustration, mental, and temporal demand of crowdworkers and analysts, while simultaneously increasing the performance of both user groups.