Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems

ACL ARR 2025 July Submission838 Authors

28 Jul 2025 (modified: 26 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper investigates defenses in LLM-based evaluation, where prompt injection attacks can manipulate scores by deceiving the evaluation system. We formalize blind attacks as a class in which candidate answers are crafted independently of the true answer. To counter such attacks, we propose an evaluation framework that combines standard and counterfactual evaluation. Experiments show it significantly improves attack detection with minimal performance trade-offs for recent models.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Large Language Models, Prompt Injection Attacks, Robust QA Evaluation, Adversarial Evaluation, LLM-as-a-Judge
Contribution Types: NLP engineering experiment
Languages Studied: English
Previous URL: https://openreview.net/forum?id=cZHx5KYXk5
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).
Reassignment Request Reviewers: No, I want the same set of reviewers from our previous submission (subject to their availability)
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: 4.1 Experimental setup, Ethics Statement
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: Ethics Statement
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: Ethics Statement
B4 Data Contains Personally Identifying Info Or Offensive Content: Yes
B4 Elaboration: Ethics Statement
B5 Documentation Of Artifacts: Yes
B5 Elaboration: 4.1 Experimental Setup
B6 Statistics For Data: Yes
B6 Elaboration: 4.1 Experimental Setup
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: 4.1 Experimental Setup
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: 4.1 Experimental Setup
C3 Descriptive Statistics: Yes
C3 Elaboration: 4.2 Results
C4 Parameters For Packages: Yes
C4 Elaboration: 4.1 Experimental Setup
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: Yes
E1 Elaboration: Ethics Statement
Author Submission Checklist: yes
Submission Number: 838
Loading