Inverse Prompt Engineering for Task-Specific LLM Safety

Stewart Slocum; Dylan Hadfield-Menell

Inverse Prompt Engineering for Task-Specific LLM Safety

Stewart Slocum, Dylan Hadfield-Menell

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: guardrails, safety, robustness, alignment

TL;DR: We propose a technique that uses prompt engineering to build task-specific safety guardrails for LLMs.

Abstract: Most real-world deployments of large language models (LLMs) operate within well-scoped tasks, yet current safety measures are general-purpose and fail to leverage this information. As a result, even in narrowly-scoped tasks, LLM applications remain vulnerable to adversarial jailbreaks. In these settings, we argue that task-specific safety guardrails solve a more tractable problem than general-purpose methods. We introduce Inverse Prompt Engineering (IPE) as an initial approach to building automatic, task-specific safety guardrails around LLMs. Our key insight is that robust safety guardrails can be derived from prompt engineering data that is already on hand. IPE operationalizes the principle of least privilege from computer security, restricting LLM functionality to only what is necessary for the task. We evaluate our approach in two settings. First, in an example chatbot application, where IPE outperforms existing methods against both human-written and automated adversarial attacks. Second, on TensorTrust, a crowdsourced dataset of prompt-based attacks and defenses. Here, IPE improves average defense robustness by 93\%, using real-world prompt engineering data.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 13469

Loading