Abstract: The task of "unlearning" certain concepts in large language models (LLMs) has gained attention for its role in mitigating harmful, private, or incorrect outputs. Current evaluations mostly rely on behavioral tests, without monitoring residual knowledge in model parameters, which can be adversarially exploited to recover erased information. We argue that unlearning should also be assessed internally by tracking changes in the parametric traces of unlearned concepts. To this end, we propose a general evaluation methodology that uses vocabulary projections to inspect concepts encoded in model parameters. We apply this approach to localize "concept vectors" --- parameter vectors encoding concrete concepts --- and construct ConceptVectors, a benchmark of hundreds of such concepts and their parametric traces in two open-source LLMs.
Evaluation on ConceptVectors shows that existing methods minimally alter concept vectors, mostly suppressing them at inference time, while direct ablation of these vectors removes the associated knowledge and reduces adversarial susceptibility.
Our findings reveal limitations of behavior-only evaluations and advocate for parameter-based assessments. We release our code and benchmark at \url{https://anonymous.4open.science/r/ConceptVectors_review-98EF}.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: LLM Interpretability, LLM Unlearning, LLM Safety, benchmark, evaluations
Contribution Types: Model analysis & interpretability
Languages Studied: English, German.
Submission Number: 3977
Loading