Keywords: Adversarial Prompting, Large Language Models, Dialogue Modeling, Multi-Turn Interaction, Safety Evaluation
Abstract: Adversarial red teaming is a central component of large language model (LLM) safety evaluation. While prior work has cataloged attack types and measured aggregate failure rates, less attention has been paid to the structured decision-making behavior of human attackers in multi-turn interaction. In this work, we model adversarial dialogue as a hierarchical and sequential process. We introduce a structured representation that decomposes red teaming conversations into goals, strategies, and tactics, where strategies capture distinct vulnerability dimensions and tactics operationalize these strategies at the linguistic level. Using 38,961 multi-turn conversations from a large-scale red teaming dataset, we analyze both first-turn strategy effects and multi-turn adaptation dynamics. Causal estimation reveals systematic differences in success rates across strategic categories. Predictive modeling further shows that incorporating structured strategy, tactic, and adaptation features improves AUC from 0.719 to 0.746 over a baseline without structure. Our findings suggest that adversarial effectiveness is not uniform but varies across structured vulnerability dimensions, and that modeling red teaming as sequential strategic interaction provides measurable explanatory and predictive gains.
Scope Confirmation: To the best of my judgment, this submission falls within the scope of CoNLL.
Primary Area Selection: Interaction and Dialogue
Use Of Generative Artificial Intelligence Tools: Yes, for editing/proofreading the manuscript, Yes, for writing code
Data Collection From Human Subjects: No
Submission Type: Archival: I certify that the submission has not been previously published, nor is the material in it under review by another journal or conference. Further, no material in it will be submitted for review at another conference or journal while under review by CoNLL 2026.
Submission Number: 243
Loading