Non-maximum Suppression Also Closes the Variational Approximation Gap of Multi-object Variational Autoencoders
Keywords: Object-centric Visual Representation Learning, Deep Generative Models, Computer Vision
Abstract: Learning object-centric scene representations is crucial for scene structural understanding. However, current unsupervised scene factorization and representation learning models do not reason about scene objects' relations while making an inference. In this paper, we address the issue by introducing a differentiable correlation prior that forces the inference models to suppress duplicate object representations. The extension is evaluated by adding it to three different scene understanding approaches. The results show that the models trained with the proposed method not only outperform the original models in scene factorization and have fewer duplicate representations, but also close the approximation gap between the data evidence and the evidence lower bound.
One-sentence Summary: Current component VAEs (e.g. MONet, IODINE, MulMON) tend to infer duplicate object representations and thus produce bad scene factorization results, this paper fixes the problem.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Reviewed Version (pdf): https://openreview.net/references/pdf?id=Y7m5I4OlhK
4 Replies
Loading