Querying Templatized Document Collections with Large Language Models

Yiming Lin, Madelon Hulsebos, Ruiying Ma, Shreya Shankar, Sepanta Zeighami, Aditya G. Parameswaran, Eugene Wu

Published: 01 Jan 2025, Last Modified: 17 Sept 2025ICDE 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Querying and extracting value from unstructured document collection remains a considerable challenge. While Large Language Models (LLMs) have made remarkable progress in document understanding, they fail to give high accuracy results for analytical queries on documents, and additionally incur high costs. While Retrieval-Augmented Generation (RAG) can reduce costs, accuracy degrades further. Our key insight is that documents in a collection often follow similar templates that impart a common semantic structure. We therefore introduce Zendb, a document analytics system that leverages this semantic structure, coupled with LLMs, to answer ad-hoc SQL queries on document collections. Zendb efficiently extracts semantic hierarchical structures from such templatized documents and introduces a novel query engine that leverages these structures for accurate and cost-effective query execution. Extensive experiments on three real-world document collections demonstrate ZENDB's benefits, achieving up to 31× cost savings compared to LLM-based baselines, while maintaining or improving accuracy, and surpassing RAG-based baselines by up to 61% in precision and 81% in recall, at a marginally higher cost.