Token-Free Hierarchical Indexing for RAG beyond LLM-based Summarization

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: SeRAG introduces a token-free hierarchical indexing framework for Retrieval-Augmented Generation that eliminates the need for expensive LLM-based summarization.
Abstract: Retrieval-Augmented Generation (RAG) increasingly relies on hierarchical indexing, yet existing frameworks are bottlenecked by the high cost and information loss of recursive, LLM-based summarization. We propose SeRAG, a novel token-free hierarchical indexing framework that replaces textual summaries with an information-theoretic knowledge taxonomy. SeRAG first transforms a corpus into a multi-perspective graph capturing semantic, logical, and sequential dependencies, then minimizes structural entropy to induce a topologically-faithful encoding tree. To bridge the gap between abstract themes and granular facts, we introduce localized structural weight-based vector aggregation for token-free community consolidation. Extensive experiments demonstrate that SeRAG significantly reduces indexing overhead while outperforming state-of-the-art methods in complex multi-hop reasoning tasks.
Lay Summary: We propose SeRAG, a new method that organizes these large document collections completely mathematically, without needing an AI to write any summaries. SeRAG works by first mapping out how all the pieces of information connect based on their meaning and logic. It then uses information theory to automatically build a structured "tree" of knowledge. Instead of generating text to summarize a group of documents, it calculates a weighted mathematical representation of them.
Link To Code: https://github.com/weiyifan1023/SeRAG
Primary Area: General Machine Learning->Clustering
Keywords: Retrieval-Augmented Generation, Structural Entropy, Clustering
Originally Submitted PDF: pdf
Submission Number: 4950
Loading