Un-Attributability: Computing Novelty from Retrieval & Semantic Similarity

17 Sept 2025 (modified: 04 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: novelty, embedding, similarity, indexing, TDA, attribution
TL;DR: We quantify which outputs cannot be attributed to the training corpus at all-- termed as novel. We study semantic novelty of LLM outputs
Abstract: Understanding how language‑model outputs relate to the pretraining corpus is central to studying model behavior. Most training‑data attribution (TDA) methods ask which training examples causally influence a given output, often using leave‑one‑out tests. We invert the question: which outputs *cannot* be attributed to any pretraining example? We introduce *un*-attributability as an operational measure of semantic novelty: an output is *novel* if the pretraining corpus contains no semantically similar context. We approximate this with a simple two-stage retrieval pipeline: index the corpus with lightweight GIST embeddings, retrieve the top‑$n$ candidates, then rerank with ColBERTv2. The less attributable a text is, relative to a human baseline, the more novel it is considered to be. We evaluate on SmolLM and SmolLM2 and report three findings: (1) models draw on pretraining data across much longer spans than previously reported; (2) some domains systematically promote or suppress novelty; and (3) instruction tuning not only alters style but also increases novelty. Reframing novelty assessment around *un*-attributability enables efficient analysis at pretraining scale. We release code and $\sim$20 TB of embeddings and index artifacts to support replication and large‑scale extension.
Primary Area: datasets and benchmarks
Submission Number: 8695
Loading