Keywords: open-weight security, undistillable models, knowledge distillation, white-box threat model, logit-rank scrambling, model exfiltration defense, large language models, label smoothing
TL;DR: We introduce Teacher Scrambling, a novel open-weight distillation defence that preserves top-k utility while preventing information gain from the teacher's logit rank distribution
Abstract: Open-weight security requires that post-release foundation models are resistant to misuse. Even if a model is made unmodifiable, an attacker may distill it into a new model they can modify. Previous works have examined preventing distillation of closed-access models. We analyze undistillability under the constraint that an attacker has access to unmodifiable language model weights and introduce Teacher Scrambling, a novel method that preserves task utility for the original model while preventing information gain from the logit rank distribution via a logit rank scrambling loss. We show that attempting to distill student models from a scrambled teacher results in worse performance than training with label smoothing, therefore defeating the purpose of attempted distillation.
Submission Number: 51
Loading