A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels

Robert J Joyce; Edward Raff; Charles K. Nicholas

A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels

Robert J Joyce, Edward Raff, Charles K. Nicholas

03 Jun 2021 (modified: 24 May 2023)Submitted to NeurIPS 2021 Datasets and Benchmarks Track (Round 1)Readers: Everyone

Keywords: ground truth refinement, reference labels

TL;DR: If you have a function that can identify when two data points are of the same class, that only works sometimes, but has few FPs, you can use it to understand your bound performance on unlabeled data.

Abstract: In some problem spaces the high cost of obtaining ground truth labels necessitates use of lower quality reference datasets. It is difficult to benchmark model changes using these datasets, as evaluation results may be misleading or biased. We propose a supplement to using reference labels which we call an approximate ground truth refinement (AGTR). Using an AGTR we prove that bounds on the precision and recall of a clustering algorithm or multiclass classifier can be computed without reference labels. We introduce a litmus test that uses an AGTR to identify inaccurate evaluation results produced from reference datasets of dubious quality. Creating an AGTR requires domain knowledge, and malware family classification is a task with robust domain knowledge approaches that support the construction of an AGTR. We demonstrate our AGTR evaluation framework by applying it to a popular malware labeling tool to diagnose over-fitting in prior testing and evaluate changes that could not be meaningfully quantified in their impact under previous data.

Supplementary Material: zip

4 Replies

Loading