TL;DR: Scientific question: How does the brain combine generative models and direct discriminative computations in high-level vision?
Abstract: Our question is how the primate brain combines generative models and direct discriminative computations in high-level vision. Both approaches aim at inferring behaviorally relevant latent variables y from visual data x. In a probabilistic setting, the inference of the posterior p(y|x) is known as discriminative inference. The two approaches differ in how discriminative inference is implemented. In the generative approach, a model of the joint distribution p(y,x) of the latent variables and the visual input is employed. This model captures information about the processes in the world that give rise to the sensory data. Approximate inference algorithms are then used to infer the posterior over the latents given an image by estimating p(y|x) = p(y,x)/p(x). In the direct discriminative approach, a direct mapping from the sensory data to the posterior over the latents p(y|x) is learned without the use of an explicit generative model. The generative approach enables unsupervised learning of the structure of the world and promises better generalization to novel situations (statistical efficiency). Direct discriminative computations promise faster inferences (computational efficiency) that are accurate for new samples from the distribution experienced in training. In practice, inference of the full posterior may not be realistic and the visual system may settle for point estimates in certain cases.