Abstract: Modern data is messy and high-dimensional, and
it is often not clear a priori what are the right questions to
ask. Instead, the analyst typically needs to use the data to
search for interesting analyses to perform and hypotheses to
test. This is an adaptive process, where the choice of analysis
to be performed next depends on the results of the previous
analyses on the same data. Ultimately, which results are reported
can be heavily influenced by the data. It is widely recognized
that this process, even if well-intentioned, can lead to biases
and false discoveries, contributing to the crisis of reproducibility
in science. But while any data-exploration renders standard
statistical theory invalid, experience suggests that different types
of exploratory analysis can lead to disparate levels of bias,
and the degree of bias also depends on the particulars of the
data set. In this paper, we propose a general information usage
framework to quantify and provably bound the bias and other
error metrics of an arbitrary exploratory analysis. We prove that
our mutual information based bound is tight in natural settings,
and then use it to give rigorous insights into when commonly used
procedures do or do not lead to substantially biased estimation.
Through the lens of information usage, we analyze the bias of
specific exploration procedures such as filtering, rank selection
and clustering. Our general framework also naturally motivates
randomization techniques that provably reduce exploration bias
while preserving the utility of the data analysis. We discuss
the connections between our approach and related ideas from
differential privacy and blinded data analysis, and supplement
our results with illustrative simulations.
0 Replies
Loading