Antidistillation Sampling

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, distillation, security
TL;DR: Antidistillation sampling strategically modifyies a model's next-token probability distribution to poison reasoning traces, rendering them significantly less effective for distillation while preserving the model's practical utility.
Abstract: Frontier models that generate extended reasoning traces inadvertently produce token sequences that can facilitate model distillation. Recognizing this vulnerability, model owners may seek sampling strategies that limit the effectiveness of distillation without compromising model performance. *Antidistillation sampling* provides exactly this capability. By strategically modifying a model's next-token probability distribution, antidistillation sampling poisons reasoning traces, rendering them significantly less effective for distillation while preserving the model's utility.
Supplementary Material: gz
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 16696
Loading