Un-Distillable LLMs via Entropy-Perturbed Logits

Published: 27 Oct 2025, Last Modified: 27 Oct 2025NeurIPS Lock-LLM Workshop 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Model Distillation Security, Information-Theoretic Bounds, Entropy-Based Obfuscation, Intellectual Property Protection
TL;DR: We prove that adding carefully designed entropy-based noise to large language models’ outputs makes them provably un-distillable, mathematically limiting any student model from copying their knowledge.
Abstract: Large Language Models (LLMs) are vulnerable to distillation attacks, where adversaries replicate a proprietary model's knowledge into a smaller student model, leading to intellectual property theft and weakened security guarantees. We address this challenge by introducing \emph{provably un-distillable LLMs} through entropy-based obfuscation of output logits. We derive information-theoretic lower bounds on the error floor of any student model trained on obfuscated outputs, showing that distillation loss scales at least quadratically with the obfuscation strength. Experiments confirm the theory: empirical student loss exceeds the derived bounds, validating the feasibility of secure and un-distillable architectures. This work establishes the first provable foundations for resisting unauthorized distillation in LLMs.
Submission Number: 70
Loading