Scam2Prompt: A Scalable Framework for Auditing Malicious Scam Endpoints in Production LLMs

13 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Security, Data Poisoning, Automated Auditing, Malicious Code Generation, Scam Phishing Detection
TL;DR: Scam2Prompt automatically audits LLMs by synthesizing developer-style prompts that trigger them to generate malicious code with scam URLs. A curated subset of these prompts reveals high malicious code generation rate across 7 SOTA LLMs.
Abstract: Large Language Models (LLMs) have become critical to modern software development, but their reliance on uncurated web-scale datasets for training introduces a significant security risk: the absorption and reproduction of malicious content. This threat is not merely theoretical, as demonstrated by a real-world incident in this paper where developers lost thousands of dollars executing LLM-generated code containing scam API endpoints. To systematically evaluate this risk, we introduce Scam2Prompt, a scalable automated auditing framework that identifies the underlying intent of a scam site and then synthesizes innocuous, developer-style prompts that mirror this intent, allowing us to test whether an LLM will generate malicious code in response to these innocuous prompts. In a large-scale study of four production LLMs (GPT-4o, GPT-4o-mini, Llama-4-Scout, and DeepSeek-V3), we found that 4.24\% of code snippets generated from these prompts contained malicious URLs. Our framework also unexpectedly discovered 62 active scam sites missed by scam databases, now confirmed and added to industry blocklists. To test the persistence of this security risk, we constructed Innoc2Scam-bench, a benchmark of 1,559 innocuous prompts that consistently elicited malicious code from all four initial LLMs. When applied to seven additional production LLMs released in 2025, we found the vulnerability is not only present but severe, with malicious code generation rates ranging from 12.7\% to 43.8\%. Furthermore, existing safety measures like state-of-the-art guardrails proved insufficient to prevent this behavior. Our findings offer conclusive evidence of large-scale data poisoning in the training pipelines of production LLMs, highlighting a fundamental security gap that requires urgent attention from the research community.
Primary Area: datasets and benchmarks
Submission Number: 4901
Loading