Keywords: instruction following, prompt injection, robustness, rule following, synthetic data, fine-tuning
TL;DR: We collect a new dataset for improving system message robustness and show supervised fine-tuning on this data significantly improves performance on robustness benchmarks.
Abstract: System messages have emerged as critical control surface for specifying the behavior of LLMs in chat applications. Developers frequently rely on the precedence of the system message over user messages, and use it to specify important guardrails, content policies, and safety countermeasures to the model. However in practice, models may fail to fully adhere to the system message, as the result of adversarial attacks such as prompt injection or simply through unforced errors when responding to benign queries. In this work we assemble a suite of benchmarks to quantify an LLM's system message robustness. We then collect a novel fine-tuning dataset starting from a diverse set of system prompts from real-world LLM applications, generating challenging synthetic user messages both benign and adversarial, and collecting high-quality model responses. Our experiments show that fine-tuning on our dataset yields considerable gains on a variety of benchmarks, compared to both the starting model as well as fine-tuning on other similarly sized datasets targeted at improving system message compliance.
Submission Number: 78
Loading