Breaking to Build: A Threat Model of Prompt-Based Attacks for Securing LLMs

23 Aug 2025 (modified: 27 Oct 2025)Submitted to NeurIPS Lock-LLM Workshop 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Security, Prompt Injection, Jailbreaking, Adversarial Prompts, Data Poisoning, Trojan Attacks, Red Teaming, Threat Modeling, AI Safety, Chain-of-Thought Exploitation, Un-Editable LLMs, Un-Finetunable LLMs, Model Integrity
TL;DR: This paper provides a systematic survey of prompt-based attack vectors, establishing a comprehensive threat model to guide the development of inherently secure and un-exploitable LLMs.
Abstract: Large Language Models (LLMs) are increasingly deployed in complex, socio-technical systems where they must operate reliably despite interacting with unreliable data sources. A critical failure mode arises from strategically manipulated inputs, where adversarial users craft prompts to bypass safety mechanisms, induce harmful behavior, or extract sensitive information. These prompt-based attacks represent a fundamental challenge to building reliable machine learning systems, as they exploit vulnerabilities in a model's training, contextual understanding, and alignment. This paper provides a comprehensive literature survey of these attack methodologies, categorizing them as forms of strategic data manipulation. We structure our analysis around direct input attacks, semantic and reasoning exploits, and system-level vulnerabilities arising from integration with external data and tools. By systematically mapping this landscape, this survey offers a foundational resource for researchers and practitioners focused on developing principled and deployable solutions for reliable ML in the face of adversarial and unreliable data.
Submission Number: 7
Loading