Abstract: Highlights•We propose a version of CLEVR that is inspired by mental rotation tests.•Latent feature volumes can be used instead of feature maps for VQA tasks grounded in 3D.•Contrastive learning can be used to learn an encoder that maps images to latent volumes.
Loading