Real-Time Visual Feedback to Guide Benchmark Creation: A Human-and-Metric-in-the-Loop WorkflowDownload PDF

30 Jan 2024OpenReview Archive Direct UploadReaders: Everyone
Abstract: Recent research has shown that language mod- els exploit ‘artifacts’ in benchmarks to solve tasks, rather than truly learning them, lead- ing to inflated model performance. In pur- suit of creating better benchmarks, we propose VAIDA, a novel benchmark creation paradigm for NLP, that focuses on guiding crowdworkers, an under-explored facet of addressing bench- mark idiosyncrasies. VAIDA facilitates sample correction by providing real-time visual feed- back and recommendations to improve sample quality. Our approach is domain, model, task, and metric agnostic, and constitutes a paradigm shift for robust, validated, and dynamic bench- mark creation via human-and-metric-in-the- loop workflows. We evaluate via expert re- view and a user study with NASA TLX. We find that VAIDA decreases effort, frustration, mental, and temporal demands of crowdwork- ers and analysts, simultaneously increasing the performance of both user groups with a 45.8% decrease in the level of artifacts in created sam- ples. As a by-product of our user study, we observe that created samples are adversarial across models, leading to decreases of 31.3% (BERT), 22.5% (RoBERTa), 14.98% (GPT-3 fewshot) in performance.
0 Replies

Loading