Keywords: Red-teaming, AI Safety, Safety Training, LLM Defense, Safety Training Data, Adversarial Training, Adversarial Attacks, Jailbreak
Abstract: We introduce WILDTEAMING, an automatic red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics, and then composes selections of multiple tactics for systematic exploration of novel and challenging jailbreaks. WILDTEAMING reveals previously unidentified vulnerabilities of frontier LLMs, resulting in more diverse and successful adversarial attacks compared to state-of-the-art jailbreaking methods.
With WILDTEAMING we create WILDJAILBREAK, a large-scale open-source synthetic safety dataset with 262K vanilla (direct request) and adversarial (complex jailbreak) prompt-response pairs. In order to mitigate exaggerated safety behaviors, WILDJAILBREAK provides two contrastive types of queries: 1) harmful queries and 2) benign queries that resemble harmful queries in form but contain no harmful intent. Through extensive model training and evaluations, we identify the training properties that enable an ideal balance of safety behaviors: appropriate safeguarding without over-refusal, effective handling of both vanilla and adversarial queries, and minimal, if any, decrease in general capabilities.
Submission Number: 95
Loading