Undistillable Open Language Models with Teacher Scrambling

Sebastian Dionicio; Aniq Elahi; Domenic Rosati; Hassan Sajjad

Undistillable Open Language Models with Teacher Scrambling

Sebastian Dionicio, Aniq Elahi, Domenic Rosati, Hassan Sajjad

Published: 27 Oct 2025, Last Modified: 27 Oct 2025NeurIPS Lock-LLM Workshop 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: open-weight security, undistillable models, knowledge distillation, white-box threat model, logit-rank scrambling, model exfiltration defense, large language models, label smoothing

TL;DR: We introduce Teacher Scrambling, a novel open-weight distillation defence that preserves top-k utility while preventing information gain from the teacher's logit rank distribution

Abstract: Open-weight security requires that post-release foundation models are resistant to misuse. Even if a model is made unmodifiable, an attacker may distill it into a new model they can modify. Previous works have examined preventing distillation of closed-access models. We analyze undistillability under the constraint that an attacker has access to unmodifiable language model weights and introduce Teacher Scrambling, a novel method that preserves task utility for the original model while preventing information gain from the logit rank distribution via a logit rank scrambling loss. We show that attempting to distill student models from a scrambled teacher results in worse performance than training with label smoothing, therefore defeating the purpose of attempted distillation.

Submission Number: 51

Loading