Are LLMs Safe Enough? Generating Complex and Implicit Adversarial Instructions for Automated Red-Teaming

Are LLMs Safe Enough? Generating Complex and Implicit Adversarial Instructions for Automated Red-Teaming

ACL ARR 2025 February Submission4978 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The safety of large language models (LLMs) is crucial for the development of trustworthy AI applications. Existing benchmarks, including human-crafted malicious instructions and model-generated jailbreak prompts, face challenges like semantic simplicity and poor cross-model generalization. We propose an Adversarial Instruction Generation Framework (AIGF), which dynamically creates complex and implicit adversarial instructions for automated red-teaming. AIGF includes adversarial attacks on target models and an iterative reflection loop for refinement. Using AIGF, we construct two datasets, AIGF Hard and AIGF Medium, which achieve high Attack Success Rate (ASR) on eight LLMs and demonstrate strong cross-model generalization. We also carried out extensive experiments to verify why AIGF is effective. We will open-source our datasets in the near future.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: model bias/fairness evaluation; model bias/unfairness mitigation; reflections and critiques;

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

Submission Number: 4978

Loading