Keywords: Large Language Models; Knowledge Attribution; Interpretability and explainable AI; Citations
TL;DR: We introduce Active Indexing, a training strategy that enables language models to provide reliable internal citations without test-time retrieval, supported by our new CitePretrainBench benchmark.
Abstract: Trustworthy language models should provide both correct and verifiable answers. However, citations generated directly by standalone LLMs are often unreliable due to hallucinations. As a result, current systems insert citations by querying an external retriever at inference time, introducing latency, infrastructure dependence, and vulnerability to retrieval noise. We explore whether LLMs can be made to reliably attribute to the documents seen during (continual) pretraining, without test‑time retrieval, by revising the training process. To study this, we construct **CitePretrainBench**, a benchmark that mixes real‑world corpora (Wikipedia, Common Crawl, arXiv) with novel, unseen documents and probes both short‑form (single fact) and long‑form (multi‑fact) citation tasks. Our approach follows a two-stage process: (1) Continual-pretraining to index factual knowledge by binding it to persistent document identifiers; (2) Instruction tuning to elicit citation behavior. We introduce **Active Indexing** for the first stage, which creates generalizable, source-anchored bindings by augmenting training with synthetic data that (i) restate each fact in diverse, compositional forms and (ii) enforce bidirectional training (source$\to$fact and fact$\to$source). This equips the model to both generate content from a cited source and attribute its own answers, improving robustness to paraphrase and composition. Experiments with Qwen‑2.5‑7B and 3B show that Active Indexing consistently outperforms a Passive Indexing baseline, which simply appends an identifier to each document, achieving citation precision gains of up to 30.2\% across all tasks and models. Our ablation studies reveal that performance continues to improve as we scale the amount of augmented data, showing a clear upward trend even at 16× the original token count.
Finally, we show that internal citations complement external ones by making the model more robust to retrieval noise.
Primary Area: interpretability and explainable AI
Submission Number: 12746
Loading