Evaluating VLMs for Score-Based, Multi-Probe Annotation of 3D Objects

Published: 30 Oct 2023, Last Modified: 30 Nov 2023SyntheticData4ML 2023 PosterEveryoneRevisionsBibTeX
Keywords: 3d objects, vision language models, semantic annotation, physical properties
TL;DR: We test VLMs on how well they understand 3D objects, using a novel likelihood-based aggregation across 2D views to avoid hallucinations.
Abstract: Unlabeled 3D objects present an opportunity to leverage pretrained vision language models (VLMs) on a range of annotation tasks---from describing object semantics to physical properties. An accurate response must take into account the full appearance of the object in 3D, various ways of phrasing the question/prompt, and changes in other factors that affect the response. We present a method, to marginalize over arbitrary factors varied across VLM queries, which relies on the VLM’s scores for sampled responses. We first show that this aggregation method can outperform a language model (e.g., GPT4) for summarization, for instance avoiding hallucinations when there are contrasting details between responses. Secondly, we show that aggregated annotations are useful for prompt-chaining; they help improve downstream VLM predictions (e.g., of object material when the object’s type is specified as an auxiliary input in the prompt). Such auxiliary inputs allow ablating and measuring the contribution of visual reasoning over language-only reasoning. Using these evaluations, we show that VLMs approach the quality of human-verified annotations on both type and material inference on the large-scale Objaverse dataset.
Submission Number: 52