Prompting Instability: An Empirical Study of LLM Robustness in Code Vulnerability Detection

Published: 2025, Last Modified: 22 Jan 2026AI (1) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Large Language Models (LLMs) are increasingly adopted in software engineering and cybersecurity tasks, such as vulnerability detection. In real-world use, humans often phrase prompts in different ways even when conveying the same intent. However, LLMs have been observed to produce inconsistent outputs in response to semantically equivalent paraphrased prompts, even for simple binary (yes/no) questions. This variability poses significant challenges to their reliability, eroding developer trust and compromising reproducibility in both research and real-world use cases. Addressing this issue is essential to ensure that LLMs can consistently support security-critical workflows.
Loading