DIME: An Information-Theoretic Difficulty Measure for AI Datasets

Peiliang Zhang; Huan Wang; Nikhil Naik; Caiming Xiong; richard socher

DIME: An Information-Theoretic Difficulty Measure for AI Datasets

Peiliang Zhang, Huan Wang, Nikhil Naik, Caiming Xiong, richard socher

Published: 07 Nov 2020, Last Modified: 05 May 2023NeurIPSW 2020: DL-IG PosterReaders: Everyone

Keywords: Dataset understanding, Difficulty Measure, Information Theory, Fano's Inequality, Conditional Entropy

TL;DR: We design DIME, an empirical difficulty measure for datasets to characterize the intrinsic complexity of the sample-label distribution in supervised learning.

Abstract: Evaluating the relative difficulty of widely-used benchmark datasets across time and across data modalities is important for accurately measuring progress in machine learning. To help tackle this problem, we propose DIME, an information-theoretic DIfficulty MEasure for datasets, based on Fano’s inequality and a neural network estimation of the conditional entropy of the sample-label distribution. DIME can be decomposed into components attributable to the data distribution and the number of samples. DIME can also compute per-class difficulty scores. Through extensive experiments on both vision and language datasets, we show that DIME is well aligned with empirically observed performance of state-of-the-art machine learning models. We hope that DIME can aid future dataset design and model-training strategies.

3 Replies

Loading