Abstract: We consider the effect of structure-agnostic and structure-dependent masking schemes when training a universal marginaliser in order to learn conditional distributions of the form $P(x_i |x_{b})$, where $x_i$ is a given random variable and $x_{b}$ is some arbitrary subset of all random variables of the generative model of interest. In other words, we mimic the self-supervised training of a denoising autoencoder, where a dataset of unlabelled data is used as partially observed input and the neural approximator is optimised to minimise reconstruction loss. We focus on studying the underlying process of the partially observed data---how good is the neural approximator at learning all conditional distributions when the observation process at prediction time differs from the masking process during training? We compare networks trained with different masking schemes in terms of their predictive performance and generalisation properties.
0 Replies
Loading