Interactive Cross-modal Learning for Text-3D Scene Retrieval

Yanglin Feng; Yongxiang Li; Yuan Sun; Yang Qin; Dezhong Peng; Peng Hu

Interactive Cross-modal Learning for Text-3D Scene Retrieval

Yanglin Feng, Yongxiang Li, Yuan Sun, Yang Qin, Dezhong Peng, Peng Hu

Published: 18 Sept 2025, Last Modified: 21 Apr 2026NeurIPS 2025 oralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Cross-modal retrieval, interactive retrieval, large language model

TL;DR: We propose an Interactive Text-to-3D Scene Retrieval Method to handle inherent query limitations.

Abstract: Text-3D Scene Retrieval (T3SR) aims to retrieve relevant scenes using linguistic queries. Although traditional T3SR methods have made significant progress in capturing fine-grained associations, they implicitly assume that query descriptions are information-complete. In practical deployments, however, limited by the capabilities of users and models, it is difficult or even impossible to directly obtain a perfect textual query suiting the entire scene and model, thereby leading to performance degradation. To address this issue, we propose a novel Interactive Text-3D Scene Retrieval Method (IDeal), which promotes the enhancement of the alignment between texts and 3D scenes through continuous interaction. To achieve this, we present an Interactive Retrieval Refinement framework (IRR), which employs a questioner to pose contextually relevant questions to an answerer in successive rounds that either promote detailed probing or encourage exploratory divergence within scenes. Upon the iterative responses received from the answerer, IRR adopts a retriever to perform both feature-level and semantic-level information fusion, facilitating scene-level interaction and understanding for more precise re-rankings. To bridge the domain gap between queries and interactive texts, we propose an Interaction Adaptation Tuning strategy (IAT). IAT mitigates the discriminability and diversity risks among augmented text features that approximate the interaction text domain, achieving contrastive domain adaptation for our retriever. Extensive experimental results on three datasets demonstrate the superiority of IDeal. Code is available at https://github.com/Yangl1nFeng/IDeal.

Supplementary Material: zip

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 14862

Loading