Keywords: Large Language Models, Machine Learning, Contrastive Decoding, Machine Unlearning, Retrieval-Augmented Generation, Knowledge Graph
Abstract: Large language models (LLMs) trained on web-scale data inevitably encode outdated, private, or undesired knowledge, posing challenges for privacy, safety, and factual reliability. While existing machine unlearning methods typically rely on retraining or fine-tuning, these approaches are costly and risk catastrophic forgetting. In this work, we propose CRED, an in-context unlearning method that enables LLMs to forget specific concepts at inference time without any parameter updates. CRED formulates unlearning as a decoding-time intervention: given a query, it constructs retrieval-augmented prompts from both a retain set and a forget set, then computes a contrastive residual vector from their decoder embeddings. This residual is injected into the decoder of the original prompt, guiding generation away from forget-set content while preserving relevant knowledge. Experiments on the TOFU and MUSE benchmarks demonstrate that \modelnameshort achieves effective concept erasure with minimal quality degradation. Additional analyses confirm its stability under 8-bit and 4-bit quantization, highlighting its robustness and practicality for real-world deployment.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 1926
Loading