Final-turn-only Replay as Context Ablation Evaluation for Multi-Turn Automated Red Teaming

Final-turn-only Replay as Context Ablation Evaluation for Multi-Turn Automated Red Teaming

ACL ARR 2026 January Submission9189 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-turn Automated Red Teaming, jailbreak, attack success rate, LLM-based Red Teaming, Human Red Teaming

Abstract: Red teaming evaluates Large Language Model (LLM) safety from an adversarial perspective, and recent work has scaled this into multi-turn automated attacks intended to surface vulnerabilities that arise through iterative, context-dependent interaction. However, multi-turn \emph{format} can be conflated with multi-turn \emph{essentiality}: some ``multi-turn'' jailbreaks may be largely single-turn reducible, succeeding from the final attacker request alone. We propose a simple context-ablation protocol, \emph{final-turn-only replay evaluation}, to measure this reducibility. For each attack transcript, we replay only the attacker’s final user-facing turn to the same target LLM as a new single-turn input under identical system prompts and decoding. We report the attack success rate over full-conversation, the attack success rate under final-turn-only replay and their difference as an operational proxy for multi-turn dependency. We evaluate the protocol across both \textbf{human and automated} multi-turn red teaming. We first collect human red-teaming dialogues in a workshop and measure their reducibility. We implemented an existing LLM-based automated multi-turn red-teaming pipeline and designed a toolbox with multiple variants. We also developed additional variants in which a crescendo-style, stepwise escalation strategy is incorporated as a tool. Experimental results show that the best-performing configuration in the full-conversation setting does not necessarily achieve the best attack success rate under final-turn-only replay. Human red-teaming results show that most of the attacks were inherently multi-turn. These findings suggest that reporting should not be limited to full-conversation performance; it is preferable to include the attack success rate under final-turn-only replay and the difference of the two attack success rates as well.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: red teaming,safety and alignment, evaluation and metrics;

Contribution Types: Model analysis & interpretability

Languages Studied: Japanese

Submission Number: 9189

Loading