Keywords: System Prompt extraaction, large language models, evaluation framework
Abstract: The system prompt in Large Language Models (LLMs) plays a pivotal role in guiding model behavior and response generation. Often containing private configuration details, user roles, and operational instructions, the system prompt has become an emerging attack target. Recent studies have shown that LLM system prompts are highly susceptible to extraction attacks through meticulously designed queries, raising significant privacy and security concerns. Despite the growing threat, there is a lack of systematic studies of system prompt extraction attacks and defenses. In this paper, we present a comprehensive framework, SPE-LLM, to systematically evaluate System Prompt Extraction attacks and defenses in LLMs, where we propose several adversarial query design techniques, defense mechanisms, and compare them with the state-of-the-art (SOTA) baselines. First, we design a set of novel adversarial queries that effectively extract system prompts from the SOTA LLMs, demonstrating the severe risks of LLM system prompt extraction. Second, we propose several defense techniques to mitigate the attacks, providing practical solutions for secure LLM deployments. Third, we used a diverse set of evaluation metrics to accurately quantify the severity of system prompt extraction attacks in LLMs and conduct comprehensive experiments across multiple benchmark datasets, which validate the efficacy of our proposed SPE-LLM framework.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 14461
Loading