Abstract: Large Language Models (LLMs) are increasingly deployed in high-impact scenarios, raising concerns about their safety and security. Despite existing defense mechanisms, LLMs remain vulnerable to adversarial attacks. This paper introduces the novel pipeline SENTRY (semantic entropy-based attack recognition system) for detecting such attacks by leveraging the predictive entropy of model outputs, quantified through the Token-Level Shifting Attention to Relevance (TokenSAR) score, a weighted token entropy measurement. Our approach dynamically identifies adversarial inputs without relying on prior knowledge of attack specifications. It requires only ten newly generated tokens, making it a computationally efficient and adaptable solution.
We evaluate the pipeline on multiple state-of-the-art models, including Llama, Vicuna, Falcon, Deep Seek, and Mistral, using a diverse set of adversarial prompts generated via the h4rm31 framework. Experimental results demonstrate a clear separation in TokenSAR scores between benign, malicious, and adversarial prompts. This distinction enables effective threshold-based classification, achieving robust detection performance across various model architectures. Our method outperforms traditional defenses in terms of adaptability and resource efficiency.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: Interpretability and Analysis of Models for NLP, Language Modeling, NLP Applications
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 281
Loading