Script Kiddie Uplift: Measuring Procedural Misuse Amplification in AI Agents
Keywords: AI agents, Cybersecurity, Script kiddies
TL;DR: Our study demonstrates that step-by-step prompting bypasses AI agents’ safety controls, generating code that significantly improves novice attackers’ capabilities and lowers the barrier to cybercrime.
Abstract: We study diffuse cybersecurity risk and how mid-2025 Large Language Model (LLM) generated code uplifts novice attackers.
We focus on script kiddies—low-skill opportunistic attackers who stand to gain the most from AI-assisted coding—and examine procedural misuse amplification: whether models assist harmful goals when requests are decomposed into benign substeps.
We develop a benchmark of eight cyber-offensive tasks spanning five MITRE ATT\&CK tactics, designed to be low-resource, novice-accessible, and reflective of exploits a script kiddie might realistically attempt. Each task is decomposed into substeps validated by experts in feasibility of completion. Evaluating 10 models across four families, we find three key results. First, step-by-step interactions rarely trigger refusal, even when the same models recognize safety concerns in separate evaluation. Second, most models produce functionally-complete code, with Gemini 2.5 Pro, GPT-5, and GPT-5 Mini providing the most operationally-ready outputs. Third, in a human study with 38 novice participants and 101 task attempts across eight scenarios, code-assisted participants progressed nearly one additional step further on average ($\hat{\beta} = 0.88$, 95\% CI: $[0.16, 1.71]$), and the code group completed more steps in seven of eight tasks compared to an internet-only baseline. A complementary meta-analysis of per-task effect sizes is directionally consistent but not statistically significant ($\widehat{\text{CLES}} = 0.59$, 95\% CI: $[0.46, 0.72]$), likely reflecting limited power from eight tasks. Among participants who completed tasks in both settings, weaker performers benefited disproportionately—those below the 60th percentile closed on average 47\% of the gap to top performers.
These findings demonstrate that multi-step decomposition circumvents current safety measures and that LLMs measurably lower barriers for precisely the population least capable without them.
PDF: pdf
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 65
Loading