Keywords: Evaluation with VLMs, Evaluation of object-centric models
TL;DR: We provide a holistic evaluation framework for object-centric models that evaluate what and where attributes using a single metric.
Abstract: Object-centric learning (OCL) methods were developed by taking inspiration from how humans perceive a scene. It is conjectured that they achieve compositional generalisation by decomposing the scene into objects, making the learned models robust to out-of-distribution (OOD) scenes. However, the recent OCL literature, by and large, evaluates the learned models only on the proxy task of object discovery, which gives no information about which object properties are actually encoded in the object-centric latent representation. Moreover, these models are not evaluated for the broader goals behind object-centric methods such as compositional generalisation, OOD performance, counterfactural reasoning, etc. Our work argues that the present evaluation protocols for OCL methods are significantly limited or not scalable. We propose using vision-language models (VLMs) on top of OCL methods for evaluating them on various visual question answering tasks. We are the first to evaluate OCL methods on multiple dimensions, ranging from counter factual, OOD and compositional reasoning. We also propose a new metric that unifies the evaluation of the ‘what’ and ‘where’ attributes, making the evaluation of OCL methods more holistic compared to existing metrics. Finally, we complement our analysis with a simple multi-feature reconstruction-based OCL method that outperforms the state of the art across several tasks.
Primary Area: datasets and benchmarks
Submission Number: 18903
Loading