LLM-guided Hierarchical Retrieval

LLM-guided Hierarchical Retrieval

ICLR 2026 Conference Submission21440 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: information retrieval, llm, ranking, efficient inference

TL;DR: LLMs to retrieve from large corpora by having them navigate a semantic tree of the content

Abstract: Modern IR systems are increasingly tasked with answering complex, multi-faceted queries that require deep reasoning rather than simple keyword or semantic matching. While LLM based IR has shown great promise, the current retrieve-then-rerank paradigm inherits the limits of embedding-based retrieval, parametric generative approaches are difficult to adapt to new information, and long-in-context approaches that put the entire corpus in context are computationally infeasible for large document corpora due to the quadratic attention complexity. To this end, we introduce a hierarchical retrieval framework LATTICE that enables an LLM to reason and navigate a large corpus with logarithmic search complexity in the number of documents, achieved by imposing a semantic tree structure on the corpus. Our approach comprises two stages: (1) an offline process where we organize the document collection into a semantic hierarchy - we explore two LLM-driven strategies for this: a bottom-up agglomerative approach and a top-down divisive approach using multi-level summaries; (2) an online traversal stage where a "search LLM" navigates this tree. A central challenge in using LLMs for search is that the LLM's relevance judgments are *noisy, context-dependent, and unaware of the underlying hierarchy*, making it difficult to compare nodes across different branches and levels of the tree. To solve this, our traversal algorithm estimates calibrated latent relevance scores from the LLM's local outputs, which are combined into a path relevance metric to guide the search globally across the tree. Our training-free framework achieves state-of-the-art zero-shot performance on the reasoning-intensive BRIGHT benchmark (with up to 420K corpus size), demonstrating improvements of up to 9% in Recall@100 and 5% in nDCG@10. Moreover, compared to the highly specialized and fine-tuned SOTA method DIVER-v2, it achieves comparable results on BRIGHT subsets that use a static corpus for evaluation.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 21440

Loading