Empowering Large Language Models to Set Up Knowledge Retrieval Indexing via Self-Learning

Simin Niu, Mengwei Wang, Xun Liang, Zhiyu Li, Sensen Zhang, Shichao Song, Hanyu Wang, Jiawei Yang, Feiyu Xiong, Chenyang Xi

Published: 2026, Last Modified: 06 May 2026IEEE Trans. Knowl. Data Eng. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Retrieval-augmented generation (RAG) provides an efficient solution for expanding the knowledge boundaries of large language models (LLMs), where the indexing serves as a compass to guide LLMs in locating query-relevant external knowledge. Nevertheless, current indexing methods commonly encounter a critical challenge: native indexing is convenient to construct, but it usually disrupts contextual associations and constrains the expressive capacity of rich knowledge. Conversely, knowledge indexing can structure contextual knowledge, but it is often based on preset schemas that limit its generalizability. To address it, we propose a universal and flexible knowledge indexing called pseudo-graph (PG) indexing. During the indexing construction phase, we use the advanced LLMs to transform the knowledge of each raw text into a concise and structured mind map, organizing intra-document knowledge. Subsequently, independent mind maps are linked by associating highly relevant topics or consistent facts across documents, thereby establishing inter-document knowledge connections. Eventually, using the resulting knowledge network PG as the knowledge indexing can circumvent the challenges associated with schema design reliant on preset knowledge and relationship types. During the knowledge retrieval phase, we develop a PG knowledge retriever to mimic human note-reviewing, adaptively navigating and recalling query-relevant knowledge from PG. Experimental results demonstrate that retrieving relevant pseudo-subgraphs from the PG via PG indexing and retriever significantly improves performance in fact-based Q&A, hallucination correction, and two multi-document Q&A tasks, achieving $F1_{QE}$ improvements of 15.85%, 8.12%, 3.34%, and 5.73%, respectively, and outperforming the state-of-the-art baseline KGP-LLaMA. Our code is available at: https://github.com/IAAR-Shanghai/PGRAG.

External IDs:dblp:journals/tkde/NiuWLLZSWYXX26