Abstract: Evaluating machine learning models is important yet challenging in many real-world scenarios. Traditional analysis is dataset-driven, that is, models are evaluated on predefined benchmark datasets. However, these datasets can only cover a limited scope, leaving unanticipated inputs untested and weaknesses of the model unrevealed. To overcome this problem, we propose OmniInput, a novel approach to evaluate models comprehensively using an input space (i.e. internet-scale data). Our method entails efficient sampling of the inputs from the model and estimation of its corresponding output distribution, and an innovative way to calculate the model’s precision and recall curve from the output distribution with only modest human annotation effort. In our experiments, we first validate the correctness of OmniInput within a small input space where brute-force enumeration is still possible. We then show that OmniInput can quantitatively evaluate more complex models such as language models (various versions of GPT2, OLMo, and DistilBERT) and computer vision models, and analyze interesting patterns in an input space.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Adina_Williams1
Submission Number: 4040
Loading