Exploring LLM Vulnerabilities via Abductive ASCII Prompt Attacks

ACL ARR 2025 February Submission1172 Authors

13 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) excel in diverse tasks but also pose risks of misuse for harmful purposes. Aiming to strengthen defenses against such vulnerabilities, LLM safety research has explored jailbreaking attacks that bypass safeguards to generate harmful outputs. We propose Abductive ASCII Prompt Attack (APT), a novel and universally applicable jailbreaking method that requires only black-box access. APT leverages abductive framing, instructing LLMs to infer plausible steps for harmful activities rather than responding to direct queries. Additionally, APT employs ASCII encoding, a lightweight and adaptable scheme, to obscure harmful content. Experiments show APT achieves over 95\% attack success rate on GPT-series models and 70\% across all targets. Our analysis further reveals vulnerabilities in LLM safety alignment, as overly restrictive models may misclassify benign prompts.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: safety, jailbreaking, security
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 1172
Loading