Adversarial Testing for Large Language Models: Evaluating and Enhancing Robustness with AutoDAN and Fine-Tuning Techniques
Keywords: Adversarial Attacks, Jailbreaking, AI Safety, Model Robustness, Fine-Tuning, LoRA (Low-Rank Adaptation), AutoDAN, Red Teaming, Adversarial Training
TLDR: We demonstrate that fine-tuning LLMs on AutoDAN-generated adversarial examples significantly boosts security, reducing jailbreak success rates while balancing the trade-off between robustness and general utility.
Abstract: Large Language Models (LLMs) exhibit impressive generative capabilities but remain vulnerable to adversarial inputs, exposing potential risks such as data leakage, harmful content generation, and jailbreak attacks. Jailbreak attacks can fool LLMs and elicit harmful output even with safety alignments and guardrails. In this work, we performed a case study using AutoDAN – an automated adversarial attack generator – to stress-test open-source LLMs. We measure baseline attack success rates (ASRs) and then apply various fine-tuning defenses to mitigate these vulnerabilities. We used AutoDAN to construct adversarial datasets and evaluate the robustness of fine-tuned LLMs. Using llama-3-8b-instruct as the base model, we apply full Supervised Fine-Tuning (SFT) AutoDAN-style attacks. Our preliminary experiments show a reduction in jailbreak success rate after fine-tuning, while maintaining usefulness and coherence for benign queries. We conclude by outlining best practices for deploying adversarially resilient LLMs in production environments and future work to continue research with adversarial attack vulnerabilities and agentic workflows.
Submission Number: 29
Loading