Our Evaluation Metric Needs an Update to Encourage GeneralizationDownload PDF

30 Jan 2024OpenReview Archive Direct UploadReaders: Everyone
Abstract: Models that surpass human performance on sev- eral popular benchmarks display significant degra- dation in performance on exposure to Out of Distribution (OOD) data. Recent research has shown that models overfit to spurious biases and ‘hack’ datasets, in lieu of learning generalizable features like humans. In order to stop the inflation in model performance – and thus overestimation in AI systems’ capabilities – we propose a simple and novel evaluation metric, WOOD Score, that encourages generalization during evaluation.
0 Replies

Loading