Intrinsic Test of Unlearning Using Parametric Knowledge Traces

Intrinsic Test of Unlearning Using Parametric Knowledge Traces

ACL ARR 2025 May Submission3977 Authors

19 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The task of "unlearning" certain concepts in large language models (LLMs) has gained attention for its role in mitigating harmful, private, or incorrect outputs. Current evaluations mostly rely on behavioral tests, without monitoring residual knowledge in model parameters, which can be adversarially exploited to recover erased information. We argue that unlearning should also be assessed internally by tracking changes in the parametric traces of unlearned concepts. To this end, we propose a general evaluation methodology that uses vocabulary projections to inspect concepts encoded in model parameters. We apply this approach to localize "concept vectors" --- parameter vectors encoding concrete concepts --- and construct ConceptVectors, a benchmark of hundreds of such concepts and their parametric traces in two open-source LLMs. Evaluation on ConceptVectors shows that existing methods minimally alter concept vectors, mostly suppressing them at inference time, while direct ablation of these vectors removes the associated knowledge and reduces adversarial susceptibility. Our findings reveal limitations of behavior-only evaluations and advocate for parameter-based assessments. We release our code and benchmark at \url{https://anonymous.4open.science/r/ConceptVectors_review-98EF}.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: LLM Interpretability, LLM Unlearning, LLM Safety, benchmark, evaluations

Contribution Types: Model analysis & interpretability

Languages Studied: English, German.

Submission Number: 3977

Loading