3D Question Answering via only 2D Vision-Language Models

FENGYUN WANG; Sicheng Yu; Jiawei Wu; Jinhui Tang; Hanwang Zhang; Qianru Sun

3D Question Answering via only 2D Vision-Language Models

FENGYUN WANG, Sicheng Yu, Jiawei Wu, Jinhui Tang, Hanwang Zhang, Qianru Sun

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large vision-language models (LVLMs) have significantly advanced numerous fields. In this work, we explore how to harness their potential to address 3D scene understanding tasks, using 3D question answering (3D-QA) as a representative example. Due to the limited training data in 3D, we do not train LVLMs but infer in a zero-shot manner. Specifically, we sample 2D views from a 3D point cloud and feed them into 2D models to answer a given question. When the 2D model is chosen, e.g., LLAVA-OV, the quality of sampled views matters the most. We propose cdViews, a novel approach to automatically selecting critical and diverse Views for 3D-QA. cdViews consists of two key components: viewSelector prioritizing critical views based on their potential to provide answer-specific information, and viewNMS enhancing diversity by removing redundant views based on spatial overlap. We evaluate cdViews on the widely-used ScanQA and SQA benchmarks, demonstrating that it achieves state-of-the-art performance in 3D-QA while relying solely on 2D models without fine-tuning. These findings support our belief that 2D LVLMs are currently the most effective alternative (of the resource-intensive 3D LVLMs) for addressing 3D tasks.

Lay Summary: Understanding 3D scenes is crucial in AI, but training 3D models is costly and data-limited. We take a different approach by using only 2D images and a well-pretrained 2D vision-language model in a zero-shot inference manner. We introduce cdViews, a lightweight view selection strategy that identifies the most critical and diverse views for each question. This enables the model to process views that are most likely to contain the visual information necessary to answer the question, without relying on any 3D-specific training or feature alignment. Our method achieves state-of-the-art performance on two standard 3D question answering benchmarks, demonstrating the effectiveness of 2D models for efficient and scalable 3D scene understanding.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/fereenwong/cdViews

Primary Area: Applications->Computer Vision

Keywords: 3D Question Answering, 2D Vision-Language Models

Submission Number: 918

Loading