Distilling Safe LLM Systems via Soft Prompts

Published: 01 Jul 2025, Last Modified: 07 Jul 2025ICML 2025 R2-FM Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Safety, Soft Prompts
TL;DR: We distill safe LLM systems into learnt soft prompts.
Abstract: Large Language Models (LLMs) have enabled machine learning to be integrated in complex tasks across various domains. This is cause for concern since LLMs may respond to carefully crafted prompt with unsafe content, necessitating concrete safety mechanisms. Current solutions involve dual-model systems combining LLMs with guard models. However, the substantial memory and computational demands of guard models pose significant challenges for deployment. This paper proposes an efficient method for approximating the functionality of dual-model systems using learned embeddings, also known as soft prompts. We introduce a novel distillation framework which optimizes the total variation distance between the outputs of an LLM with a guard and the same LLM enhanced with our soft prompts. At test-time the learned soft prompts are prepended to user prompts, providing safety at a fraction of the costs incurred by a guard model. Evaluations on various benchmarks demonstrate improved safety, offering an efficient alternative to guard models for hardware-constrained applications.
Submission Number: 104
Loading