Keywords: machine unlearning, representational erasure, linear probes, membership inference, nearest-neighbor purity, Grad-CAM, CKA, CNNs, Vision Transformers, CNR
TL;DR: Beyond output masking: CNR erases forgotten-class features so linear/nonlinear decoders and MI attacks fall to near-chance, with minimal retain-set loss.
Abstract: Current machine unlearning methods reduce predictions for forgotten classes but often leave their internal representations intact, achieving avoidance rather than erasure. We define true unlearning as the elimination of class-specific information from hidden states such that no simple or robust decoder can recover it. We introduce CNR (Class-Specific Neuronal Reset), an architecture-agnostic procedure with three steps: (1) identify class-selective units via mean activation screening, (2) apply targeted resets by fine-tuning on GAN-generated synthetic samples derived from the forget classes to suppress activation of forget-specific pathways, and (3) perform retain-only fine-tuning with regularization to restore global function. Across MNIST, CIFAR-10/100, LFW, and CUB-200-2011 on CNNs and ViTs, prior approaches (gradient ascent, KD-based unlearning, logit masking, retain-only fine-tuning) suppress forget-class accuracy yet still permit decoding above chance from hidden states. CNR drives linear probes, k-NN and SVM decoders, and membership inference attacks to chance performance, while reducing nearest-neighbor label purity to the class prior. It achieves this with minimal retained-class degradation ($\leq 5\%$ drop) and preserved CKA similarity. Grad-CAM and layer-wise analyses confirm targeted class-selective erasure rather than global damage.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 23273
Loading