CAP: Improving the Robustness of LLM-as-a-Judge Against Adversarial Score Manipulation via Comparative Augmented Prompting
Keywords: LLM-as-a-Judge, Adversarial Robustness, Adversarial Attacks, CAP
TL;DR: We propose CAP, a method that integrates comparative evaluation into absolute scoring to significantly improve the robustness of LLM-as-a-Judge systems against adversarial attacks.
Abstract: Large language models (LLMs) have been widely adopted in automated evaluation tasks, demonstrating human-aligned assessment capabilities. However, studies reveal that LLM-as-a-Judge systems exhibit significant vulnerability to adversarial attacks, where carefully crafted suffixes or prompts can easily manipulate scoring outcomes and cause abnormal score inflation, severely undermining evaluation reliability. Through systematic experimentation, we find that comparative evaluation tasks exhibit stronger robustness against such attacks compared to absolute scoring tasks. Building on this insight, we propose the CAP (Comparative Augmented Prompting) method, which enhances robustness by integrating a structured comparative mechanism into the absolute scoring framework. Experimental results on two benchmark datasets, SummEval and TopicalChat, show that the CAP method effectively maintains scoring accuracy while significantly reducing the success rate of adversarial attacks, offering a promising solution for building more secure and reliable LLM-as-a-Judge systems.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 17609
Loading