PRISM: A Unified and Generalizable Adversarial Robustness Evaluation Framework for LLM-based Classification
Keywords: Adversarial Robustness, Generative Models, Frontier Models, Phishing Detection, Linguistic Refinement, Prompt Injection, Cross-lingual Shifts
TL;DR: A Unified and Generalizable Adversarial Robustness Evaluation Framework for LLM-based Classification
Abstract: Phishing email compromise persists as one of the most pervasive and globally consequential vectors of cyber intrusion. Detection remains particularly challenging in multilingual environments, where script diversity, low-resource languages, and adversarial linguistic shifts increase false-positive and false-negative rates. Although Large Language Models (LLMs) achieve high baseline performance on phishing detection, their resilience under adversarial manipulations and multilingual distributional shifts is insufficiently characterized. We present PRISM, a unified and generalizable framework that evaluates adversarial robustness of LLM-based classification. PRISM integrates three attack dimensions in the form of semantic-preserving linguistic refinement, prompt-level instruction injection, and cross-lingual shifts. We instantiate phishing as a representative security-critical case study and evaluate frontier LLMs (GPT-4o, Claude Sonnet 4, and Grok-3) under PRISM. Within this framework, prompt-level manipulations are operationalized as instruction-space perturbations exploiting LLM compliance to induce misclassification. Empirically, models exhibit strong accuracy ($\approx 0.88$ to $0.95$); however, they also reveal asymmetric vulnerability signatures, with refinement reducing accuracy by $\approx12\%$ in Claude and $\approx4\%$ in GPT-4o, and large-scale prompt injections yielding attack success rates of $\approx4$ to $12\%$. Cross-lingual translation (Bangla, Chinese, Hindi; $\approx 95:5$ class composition) substantially increases false-positive rates (e.g., $+10\times$ in Claude relative to English), undermining reliable deployment. Under class imbalance, zero-shot prompting achieves improved performance relative to structured and chain-of-thought variants (mean F1 $\approx0.79$ vs $0.66$ for structured and up to $0.77$ for CoT, depending on model) while maintaining significantly lower latency. PRISM characterizes structural weaknesses in LLM detectors and establishes a principled, generalizable protocol for securing LLM-based classification in multilingual, security-critical contexts.
Primary Area: generative models
Submission Number: 21243
Loading