Benchmarking and Rethinking Knowledge Editing for Large Language Models

ICLR 2026 Conference Submission13334 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: benchmark, knowledege editing, LLMs, in-context learning, RAG
TL;DR: A benchmark covers knowledge types (triplet & event), LLM types (instruct & reasoning), inference strategies (autoregressive decoding), editing numbers (single & sequential), evaluation (four dimensions & general ability), and a strong RAG baseline.
Abstract: Knowledge editing aims to update the embedded knowledge within large language models (LLMs). However, existing approaches, whether through parameter modification or external memory integration, often suffer from impractical evaluation objectives and inconsistent experimental setups. To address this gap, we conduct a comprehensive and practically oriented benchmarking study. Particularly, we introduce more complex event-based datasets and general-purpose datasets drawn from other tasks, in addition to fact-level datasets. Our evaluation covers both instruction-tuned and reasoning-oriented LLMs, and we adopt a realistic autoregressive inference setting rather than teacher-forced decoding. Beyond single-edit assessments, we also consider multi-edit scenarios to better capture real-world requirements. We employ four evaluation metrics, with particular emphasis on portability. We compare all recent methods against a simple baseline named Selective Contextual Reasoning (SCR). Empirical results show that parameter-based editing methods perform poorly under realistic conditions, while SCR consistently outperforms them across all settings. Our findings suggest that when knowledge updates are minimal, parameter adjustments can sometimes yield higher reasoning efficiency; however, in most cases, selectively injecting external knowledge into the context proves to be the more robust strategy. Overall, this study delivers a comprehensive evaluation framework for future research and offers fresh perspectives for rethinking knowledge editing methods. The implementation is provided in the Supplementary Materials.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 13334
Loading