BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Backdoor Attacks, Backdoor Defenses, AI safety
TL;DR: In this work, we introduce \textit{BackdoorLLM}, the first comprehensive benchmark for studying backdoor attacks and defenses on LLMs.
Abstract: Generative large language models (LLMs) have achieved state-of-the-art results on a wide range of tasks, yet they remain susceptible to backdoor attacks: carefully crafted triggers in the input can manipulate the model to produce adversary-specified outputs. While prior research has predominantly focused on backdoor risks in vision and classification settings, the vulnerability of LLMs in open-ended text generation remains underexplored. To fill this gap, we introduce \textit{BackdoorLLM}\footnote{Our BackdoorLLM benchmark was awarded First Prize in the \href{https://www.mlsafety.org/safebench/winners}{SafetyBench competition} organized by the \href{https://safe.ai/}{Center for AI Safety}.}, the first comprehensive benchmark for systematically evaluating backdoor threats in text-generation LLMs. BackdoorLLM provides: (i) a unified repository of benchmarks with a standardized training and evaluation pipeline; (ii) a diverse suite of attack modalities, including data poisoning, weight poisoning, hidden-state manipulation, and chain-of-thought hijacking; (iii) over 200 experiments spanning 8 distinct attack strategies, 7 real-world scenarios, and 6 model architectures; (iv) key insights into the factors that govern backdoor effectiveness and failure modes in LLMs; and (v) a defense toolkit encompassing 7 representative mitigation techniques. Our code and datasets are available at \url{https://github.com/bboylyg/BackdoorLLM}. We will continuously incorporate emerging attack and defense methodologies to support the research in advancing the safety and reliability of LLMs.
Croissant File: json
Dataset URL: https://huggingface.co/BackdoorLLM
Code URL: https://github.com/bboylyg/BackdoorLLM
Primary Area: Machine learning approaches to data and benchmarks enrichment, augmentation and processing (supervised, unsupervised, online, active, fine-tuning, RLHF, SFT, alignment, etc.)
Submission Number: 203
Loading