CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: This paper introduces CROW, a method that eliminates backdoors in LLMs via internal consistency regularization during finetuning, using adversarial perturbations and minimal clean data to reduce attack success rates without compromising performance.
Abstract: Large Language Models (LLMs) are vulnerable to backdoor attacks that manipulate outputs via hidden triggers. Existing defense methods—designed for vision/text classification tasks—fail for text generation. We propose *Internal Consistency Regularization (CROW)*, a defense leveraging the observation that backdoored models exhibit unstable layer-wise hidden representations when triggered, while clean models show smooth transitions. CROW enforces consistency across layers via adversarial perturbations and regularization during finetuning, neutralizing backdoors without requiring clean reference models or trigger knowledge—only a small clean dataset. Experiments across Llama-2 (7B, 13B), CodeLlama (7B, 13B), and Mistral-7B demonstrate CROW’s effectiveness: it achieves significant reductions in attack success rates across diverse backdoor strategies (sentiment steering, targeted refusal, code injection) while preserving generative performance. CROW’s architecture-agnostic design enables practical deployment.
Lay Summary: How can we strip hidden “trigger phrases” out of a large language model without knowing what the trigger is or retraining the model from scratch? That was the driving question. Backdoor attacks let an adversary poison a model during training so that a secret phrase (for example *BadMagic*) makes it spout toxic text or insert malicious code while behaving normally the rest of the time. Existing defenses either rely on an untouched reference model, often unavailable in practice, or blunt the model’s usefulness by heavy pruning or full retraining. The authors set out to discover whether a single, lightweight tune-up could erase such backdoors. Their answer is CROW, a minimalist “internal consistency” finetune. In a healthy transformer, hidden activations flow smoothly from one layer to the next; a backdoor trigger disrupts that smoothness. CROW first magnifies any potential disruption by adding small adversarial nudges to the input embeddings, then penalizes large layer-to-layer jumps during a brief LoRA finetune on just one hundred clean prompts. This short procedure, finished in minutes on a single GPU, leans on a simple loss term that encourages every layer to behave almost isometrically, starving hidden triggers of the amplification they need. The result is a practical detox recipe: after one pass of CROW, a wide range of poisoned models behave as if the trigger were never planted, while their helpfulness on everyday tasks stays intact. Because the method needs no knowledge of the trigger, no separate clean model, and only a handful of clean examples, it turns the once daunting task of “backdoor removal” into something a small team or an open-source community can do in an afternoon. By enforcing this internal consistency, CROW points the way to safer, more trustworthy LLM deployments in customer service, coding assistance, and other high-stakes domains.
Link To Code: https://github.com/NayMyatMin/CROW
Primary Area: Deep Learning->Large Language Models
Keywords: LLM Security, Backdoor Defense, Consistency Regularization
Submission Number: 11400
Loading