DIME: AN INFORMATION-THEORETIC DIFFICULTY MEASURE FOR AI DATASETS

Peiliang Zhang; Huan Wang; Nikhil Naik; Caiming Xiong; Richard Socher

DIME: AN INFORMATION-THEORETIC DIFFICULTY MEASURE FOR AI DATASETS

Peiliang Zhang, Huan Wang, Nikhil Naik, Caiming Xiong, Richard Socher

25 Sept 2019 (modified: 05 May 2023)ICLR 2020 Conference Blind SubmissionReaders: Everyone

Keywords: Information Theory, Fano’s Inequality, Difficulty Measure, Donsker-Varadhan Representation, Theory

TL;DR: We extend Fano’s inequality to the common case of continuous-feature-discrete-label random variables, and design a neural-network based difficulty measure for AI datasets.

Abstract: Evaluating the relative difficulty of widely-used benchmark datasets across time and across data modalities is important for accurately measuring progress in machine learning. To help tackle this problem, we proposeDIME, an information-theoretic DIfficulty MEasure for datasets, based on conditional entropy estimation of the sample-label distribution. Theoretically, we prove a model-agnostic and modality-agnostic lower bound on the 0-1 error by extending Fano’s inequality to the common supervised learning scenario where labels are discrete and features are continuous. Empirically, we estimate this lower bound using a neural network to compute DIME. DIME can be decomposed into components attributable to the data distribution and the number of samples. DIME can also compute per-class difficulty scores. Through extensive experiments on both vision and language datasets, we show that DIME is well-aligned with empirically observed performance of state-of-the-art machine learning models. We hope that DIME can aid future dataset design and model-training strategies.

Original Pdf: pdf

8 Replies

Loading