Abstract: LLM-as-a-Judge uses a large language model (LLM) to select the
best response from a set of candidates for a given question. LLMas-a-Judge has many applications such as LLM-powered search,
reinforcement learning with AI feedback (RLAIF), and tool selection.
In this work, we propose JudgeDeceiver, an optimization-based
prompt injection attack to LLM-as-a-Judge. JudgeDeceiver injects
a carefully crafted sequence into an attacker-controlled candidate
response such that LLM-as-a-Judge selects the candidate response
for an attacker-chosen question no matter what other candidate
responses are. Specifically, we formulate finding such sequence
as an optimization problem and propose a gradient based method
to approximately solve it. Our extensive evaluation shows that
JudgeDeceive is highly effective, and is much more effective than
existing prompt injection attacks that manually craft the injected
sequences and jailbreak attacks when extended to our problem. We
also show the effectiveness of JudgeDeceiver in three case studies,
i.e., LLM-powered search, RLAIF, and tool selection. Moreover, we
consider defenses including known-answer detection, perplexity
detection, and perplexity windowed detection. Our results show
these defenses are insufficient, highlighting the urgent need for
developing new defense strategies. Our implementation is available
at this repository: https://github.com/ShiJiawenwen/JudgeDeceiver.
Loading