Abstract: This paper presents a systematic evaluation of Large Language Models' (LLMs) behaviour on long-tail distributed (encrypted) texts and their safety implications. We introduce a two-dimensional framework for assessing LLM safety in encryption contexts: (1) instruction refusal—the ability to reject harmful obfuscated instructions, and (2) generation safety—the suppression of harmful content generation. Through comprehensive experiments, we demonstrate that models that possess capabilities to decrypt ciphers are vulnerable to mismatched generalization attacks. Our analysis reveals asymmetric safety alignment across models, with some prioritizing instruction refusal while others focus on response suppression. We evaluate existing defense against our 2 dimension framework with discussion on safety and utility. Based on these findings, we propose a safety protocol that facilitates communication between pre-model and post-model safeguards address these issues. This work contributes to the understanding of LLM safety in long-tail distribution scenarios and provides directions for developing more robust safety mechanisms.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: LLM Security, Intersection of AI and Cybersecurity, Cryptanalysis
Contribution Types: Data resources
Languages Studied: English
Submission Number: 7906
Loading