Keywords: guardrails, safety, robustness, alignment
TL;DR: We propose a technique that uses prompt engineering to build task-specific safety guardrails for LLMs.
Abstract: Most real-world deployments of large language models (LLMs) operate within well-scoped tasks, yet current safety measures are general-purpose and fail to leverage this information. As a result, even in narrowly-scoped tasks, LLM applications remain vulnerable to adversarial jailbreaks. In these settings, we argue that task-specific safety guardrails solve a more tractable problem than general-purpose methods. We introduce Inverse Prompt Engineering (IPE) as an initial approach to building automatic, task-specific safety guardrails around LLMs. Our key insight is that robust safety guardrails can be derived from prompt engineering data that is already on hand. IPE operationalizes the principle of least privilege from computer security, restricting LLM functionality to only what is necessary for the task. We evaluate our approach in two settings. First, in an example chatbot application, where IPE outperforms existing methods against both human-written and automated adversarial attacks. Second, on TensorTrust, a crowdsourced dataset of prompt-based attacks and defenses. Here, IPE improves average defense robustness by 93\%, using real-world prompt engineering data.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13469
Loading