Evaluating LLM Judges in Cybersecurity Script Analysis
Keywords: evaluation, benchmarking, cybersecurity, code analysis, LLM-as-a-judge
TL;DR: We address the gap in evaluating LLM-as-a-judge systems for cybersecurity by creating a specialized benchmark and identifying which models best align with expert judgment when evaluating script behavioral summaries.
Abstract: Building on Large Language Models' (LLMs) increasing usage as judges for evaluating natural language outputs, this paper examines which models generate responses ranking higher in expert evaluation of cybersecurity script analyses. Our newly constructed dataset of 1,000+ clean and malicious scripts, with expert-curated natural language summaries, serves as a reference for the evaluation of behavioral script summarization task performed by a candidate LLM. Several judge LLMs are asked to evaluate responses generated by the candidate LLM using human responses as reference. Through manual assessment of judge evaluations, we identify those models with outputs rated higher by experts in cybersecurity contexts and analyze the factors influencing judge quality, including self-preference bias and prompting strategy effects. Our publicly released dataset supports continued research in this domain where accurate evaluation is increasingly vital.
PDF: pdf
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 208
Loading