Keywords: Kolmogorov structure function, usable information, computational constraints
TL;DR: We propose a scalable diagnostic tool for datasets based on the Kolmogorov structure function
Abstract: The degree to which a task is learnable given different computational constraints shows the amount of usable information at different scales. An instantiation of this idea is the \textit{Kolmogorov Structure Function} (KSF), which shows how the fit of an optimal $k$-bit description of a given string improves for increasing values of $k$. While conceptually appealing, computing the KSF is infeasible in practice due to the exponentially large search space of all descriptions of a given length, in addition to the unbounded time complexity. This paper proposes the Constrained Structure Function (CSF), a generalization of the KSF that can be computed efficiently by taking into account realistic computational constraints. In addition to being feasible to compute, the CSF of a dataset can be expressed as the sum of datapoint-wise functions which reflect the degree to which each datapoint is typical in the context of the dataset. Empirically, we demonstrate that the CSF can be used for detecting individual datapoints with characteristics such as being easy, mislabeled, or belonging to a hidden subgroup.
In-person Presentation: yes
0 Replies
Loading