Abstract: The maximum mean discrepancy (MMD) between two probability measures P
and Q is a metric that is zero if and only if all moments of the two measures
are equal, making it an appealing statistic for two-sample tests. Given i.i.d. samples
from P and Q, Gretton et al. (2012) show that we can construct an unbiased
estimator for the square of the MMD between the two distributions. If P is a
distribution of interest and Q is the distribution implied by a generative neural
network with stochastic inputs, we can use this estimator to train our neural network.
However, in practice we do not always have i.i.d. samples from our target
of interest. Data sets often exhibit biases—for example, under-representation of
certain demographics—and if we ignore this fact our machine learning algorithms
will propagate these biases. Alternatively, it may be useful to assume our data has
been gathered via a biased sample selection mechanism in order to manipulate
properties of the estimating distribution Q.
In this paper, we construct an estimator for the MMD between P and Q when we
only have access to P via some biased sample selection mechanism, and suggest
methods for estimating this sample selection mechanism when it is not already
known. We show that this estimator can be used to train generative neural networks
on a biased data sample, to give a simulator that reverses the effect of that
bias.
TL;DR: We propose an estimator for the maximum mean discrepancy, appropriate when a target distribution is only accessible via a biased sample selection procedure, and show that it can be used in a generative network to correct for this bias.
Keywords: generative networks, two sample tests, bias correction, maximum mean discrepancy
7 Replies
Loading