BadJudge: Backdoor Vulnerabilities of LLM-As-A-Judge

ICLR 2025 Conference Submission12474 Authors

27 Sept 2024 (modified: 28 Nov 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM-as-a-Judge, LLM Evaluator, Backdoor Attack, Backdoor Defense
TL;DR: Judges can be backdoored, model merge can fix this. Rerankers and Guardrails are vulnerable too.
Abstract: This paper exposes the backdoor threat in automatic evaluation with LLM-as-a-Judge. We propose a novel threat model, where the adversary assumes control of both the candidate model and evaluator model. The victims are the benign users who are unfairly rated by the backdoored evaluator in favor of the adversary. A trivial single token backdoor poisoning 1% of the evaluator training data triples the score of the adversary, compared with the adversary's legitimate score. We systematically categorize levels of data access corresponding to three real-world settings, (1) web poisoning, (2) malicious annotator, and (3) weight poisoning. These levels reflect a weak to strong escalation of access that highly correlates with attack severity. The (1) web poisoning setting contains the weakest assumptions, but we still inflate scores by over 20%. The (3) weight poisoning setting holds the strongest assumptions, allowing the adversary to inflate their scores from 1.5/5 to 4.9/5. The backdoor threat generalizes to different evaluator architectures, trigger designs, evaluation tasks, and poisoning rates. By poisoning 10% of the evaluator training data, we control toxicity judges (Guardrails) to misclassify toxic prompts as non-toxic 89% of the time and document reranker judges in RAG to rank the poisoned document first 97% of the time. Defending the LLM-as-a-Judge setting is challenging because false positive errors are unethical. This limits the available tools we can use for defense. Fortunately, we find that model merging, a simple and principled knowledge transfer tool used to imbue LLM Judges with multi-task evaluation abilities, works surprisingly well. The backdoor is mitigated to near 0% whilst maintaining SOTA performance. Its low computational cost, convenient integration into the current LLM Judge training pipeline, and SOTA performance position it as a promising avenue for backdoor mitigation in the LLM-as-a-Judge setting.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12474
Loading