SSRB: Direct Natural Language Querying to Massive Heterogeneous Semi-Structured Data

Xin Zhang; Mingxin Li; Yanzhao Zhang; Dingkun Long; Yongqi Li; Yinghui Li; Pengjun Xie; Meishan Zhang; Wenjie Li; Min Zhang; Philip S. Yu

SSRB: Direct Natural Language Querying to Massive Heterogeneous Semi-Structured Data

Xin Zhang, Mingxin Li, Yanzhao Zhang, Dingkun Long, Yongqi Li, Yinghui Li, Pengjun Xie, Meishan Zhang, Wenjie Li, Min Zhang, Philip S. Yu

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY-NC 4.0

Keywords: semi-structured data, text retrieval, text embedding

Abstract: Searching over semi-structured data with natural language (NL) queries has attracted sustained attention, enabling broader audiences to access information easily. As more applications, such as LLM agents and RAG systems, emerge to search and interact with semi-structured data, two major challenges have become evident: (1) the increasing diversity of domains and schema variations, making domain-customized solutions prohibitively costly; (2) the growing complexity of NL queries, which combine both exact field matching conditions and fuzzy semantic requirements, often involving multiple fields and implicit reasoning. These challenges make formal language querying or keyword-based search insufficient. In this work, we explore neural retrievers as a unified non-formal querying solution by directly index semi-structured collections and understand NL queries. We employ LLM-based automatic evaluation and build a large-scale semi-structured retrieval benchmark (SSRB) using LLM generation and filtering, containing 14M semi-structured objects from 99 different schemas across 6 domains, along with 8,485 test queries that combine both exact and fuzzy matching conditions. Our systematic evaluation of popular retrievers shows that current state-of-the-art models could achieve acceptable performance, yet they still lack precise understanding of matching constraints. While by in-domain training of dense retrievers, the performance can be significantly improved. We believe that our SSRB could serve as a valuable resource for future research in this area, and we hope to inspire further exploration of semi-structured retrieval with complex queries.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/vec-ai/struct-ir-qrels

Code URL: https://github.com/vec-ai/struct-ir

Primary Area: Datasets & Benchmarks illustrating Different Deep learning Scenarios (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 589

Loading