Exploring LLM Vulnerabilities via Abductive ASCII Prompt Attacks

Exploring LLM Vulnerabilities via Abductive ASCII Prompt Attacks

ACL ARR 2025 February Submission1172 Authors

13 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) excel in diverse tasks but also pose risks of misuse for harmful purposes. Aiming to strengthen defenses against such vulnerabilities, LLM safety research has explored jailbreaking attacks that bypass safeguards to generate harmful outputs. We propose Abductive ASCII Prompt Attack (APT), a novel and universally applicable jailbreaking method that requires only black-box access. APT leverages abductive framing, instructing LLMs to infer plausible steps for harmful activities rather than responding to direct queries. Additionally, APT employs ASCII encoding, a lightweight and adaptable scheme, to obscure harmful content. Experiments show APT achieves over 95\% attack success rate on GPT-series models and 70\% across all targets. Our analysis further reveals vulnerabilities in LLM safety alignment, as overly restrictive models may misclassify benign prompts.

Paper Type: Long

Research Area: Machine Translation

Research Area Keywords: safety, jailbreaking, security

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 1172

Loading