SecMutBench: Evaluating LLM-Generated Security Tests via Mutation-Based Vulnerability Detection

Mariam ALMutairi

SecMutBench: Evaluating LLM-Generated Security Tests via Mutation-Based Vulnerability Detection

Mariam ALMutairi

Published: 14 May 2026, Last Modified: 14 May 2026AIWare 2026 Benchmark and DatasetEveryoneRevisionsCC BY 4.0

Keywords: security benchmark, mMutation testing, large language models, vulnerability detection, code security, CWE, LLM evaluation, software testing, security testing

TL;DR: SecMutBench reveals that traditional mutation scores inflate LLM security testing performance by 2.2×. Its new Security Mutation Score filters out crash-based kills, benchmarked across 339 Python programs, 30 CWEs, and 8 LLMs.

Abstract: Existing LLM security benchmarks evaluate code generation quality, leaving an open question: can LLMs generate tests that detect vulnerabilities? We address this with two technical contributions. First, we propose the Security Mutation Score (SMS), a metric that classifies mutant kills into semantic, functional, incidental, and crash categories using operator-aware heuristics, distinguishing genuine security awareness from coincidental detection. We further define Effective SMS (EffSMS = SMS × Secure-Pass Rate) to account for test validity. Second, we design 25 security-specific mutation operators spanning 30 CWE categories that transform secure Python code into realistic vulnerable variants, extending prior security mutation frameworks to Python and introducing 22 new operators. Evaluating eight LLMs and two static analysis baselines on 339 programs and 1,869 mutants reveals three findings: (i) traditional mutation scores overstate LLM security testing capability by 2.2× on average; (ii) the best LLM achieves only 19.7% EffSMS vs. 47.6% for expert-written tests—a 2.4× gap raw metrics obscure; and (iii) functional kills, not crashes, dominate non-semantic failures (15–36%), showing LLMs detect behavioral side-effects rather than security properties. Static analysis and mutation testing provide complementary coverage across syntactic vs. logic-flaw CWEs. Code and data are publicly available.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 7

Loading