Keywords: jailbreak templates, red teaming, LLMs, multi-turn to single-turn compression, evolutionary algorithms, prompt engineering, automated discovery, X-Teaming, LLM-as-judge, StrongREJECT
TL;DR: We present X-Teaming Evolutionary M2S, an automated LLM-guided framework that evolves multi-turn jailbreak attacks into optimized single-turn templates, yielding 44.8% success on GPT-4.1 and variable cross-model transfer.
Abstract: Multi-turn-to-single-turn (M2S) compresses iterative red teaming into one structured prompt, but prior work relied on a few hand-crafted formats. We present **X‑Teaming Evolutionary M2S**, an automated framework that *discovers* and *optimizes* M2S templates via LLM‑guided evolution, with smart sampling (12 sources), a StrongREJECT‑style *LLM‑as‑judge*, and auditable logs.
To restore selection pressure, we calibrate the success threshold to θ=0.70. On GPT‑4.1 this yields *five* generations, *two* new template families, and **44.8%** overall success (103/230). A balanced cross‑model panel (2,500 trials; judge fixed) shows that structural gains transfer but vary by target; two models score zero at θ=0.70. We also observe a positive length–score coupling, motivating length‑aware judging.
Our results establish structure‑level search as a reproducible path to stronger single‑turn probes and highlight threshold calibration and cross‑model evaluation as key to progress. Code, configs, and artifacts: [https://github.com/hyunjun1121/M2S-x-teaming](https://github.com/hyunjun1121/M2S-x-teaming).
Submission Number: 36
Loading