- Keywords: Information Theory, Fano’s Inequality, Difficulty Measure, Donsker-Varadhan Representation, Theory
- TL;DR: We extend Fano’s inequality to the common case of continuous-feature-discrete-label random variables, and design a neural-network based difficulty measure for AI datasets.
- Abstract: Evaluating the relative difficulty of widely-used benchmark datasets across time and across data modalities is important for accurately measuring progress in machine learning. To help tackle this problem, we proposeDIME, an information-theoretic DIfficulty MEasure for datasets, based on conditional entropy estimation of the sample-label distribution. Theoretically, we prove a model-agnostic and modality-agnostic lower bound on the 0-1 error by extending Fano’s inequality to the common supervised learning scenario where labels are discrete and features are continuous. Empirically, we estimate this lower bound using a neural network to compute DIME. DIME can be decomposed into components attributable to the data distribution and the number of samples. DIME can also compute per-class difficulty scores. Through extensive experiments on both vision and language datasets, we show that DIME is well-aligned with empirically observed performance of state-of-the-art machine learning models. We hope that DIME can aid future dataset design and model-training strategies.