QualEval: Qualitative Evaluation for Model Improvement

23 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: qualitative evaluation, evaluation, framework
TL;DR: We propose QualEval, the first qualitative evaluation framework for LLMs that automatically provides natural language actionable insights to improve the model performance.
Abstract: Quantitative evaluation metrics have played a central role in measuring the progress of natural language systems (NLP) systems like large language models (LLMs) thus far, but they come with their own weaknesses. Given the complex and intricate nature of real-world tasks, a simple scalar to quantify and compare models is a gross trivialization of model behavior that ignores its idiosyncrasies. As a result, scalar evaluation metrics like accuracy make the actual model improvement process an arduous one. It currently involves a lot of manual effort which includes analyzing a large number of data points and making hit-or-miss changes to the training data or setup. This process is even more excruciating when this analysis needs to be performed on a cross-product of multiple models and datasets. In this work, we address the shortcomings of quantitative metrics by proposing our method QualEval, which enables automated qualitative evaluation as a vehicle for model improvement. QualEval provides a comprehensive dashboard with fine-grained analysis and human-readable insights to improve the model. We show that utilizing the dashboard generated by QualEval improves performance by up to 12% relatively on a variety of datasets, thus leading to agile model development cycles both on open-source and closed-source models and on a variety of setups like fine-tuning and in-context learning. In essence, QualEval serves as an automated data-scientist-in-a-box. Given the focus on critiquing and improving current evaluation metrics, our method serves as a refreshingly new technique towards both model evaluation and improvement.
Primary Area: visualization or interpretation of learned representations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7901
Loading