Distilling Safe LLM Systems via Soft Prompts

Motasem Alfarra; Cristina Pinneri; Dana Kianfar; Christos Louizos

Distilling Safe LLM Systems via Soft Prompts

Motasem Alfarra, Cristina Pinneri, Dana Kianfar, Christos Louizos

12 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Safety

TL;DR: We distill safe LLM systems into learnt embeddings for better memory and computational efficiency.

Abstract: Large Language Models (LLMs) have enabled machine learning to be integrated in complex tasks across various domains. This is a cause for concern since LLMs may respond to carefully crafted prompts with unsafe content, thus necessitating concrete safety mechanisms. Current solutions involve dual-model systems combining LLMs with guard models. However, the substantial memory and computational demands of guard models pose significant challenges for deployment. This paper proposes an efficient method for approximating the behavior of dual-model systems using learned embeddings, also known as soft prompts.We introduce a novel distillation framework which optimizes the total variation distance between the outputs of an LLM paired with a guard and the same LLM equipped with our soft prompts. At test time, the learned soft prompts are prepended to user prompts, providing safety at a fraction of the memory and compute costs incurred by a guard model. Our evaluations on various benchmarks demonstrate improved safety of the LLM, offering an efficient alternative to guard models for memory- and computation-constrained settings such as hardware applications.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 4379

Loading