Keywords: LLM safety, Jailbreak Detection, Black-box detectors, Semantic entropy, Behavioral consistency, Adversarial prompting, Safety alignment
Abstract: Black-box jailbreak detection for Large Language Models (LLMs) remains challenging, particularly when internal states are inaccessible. Semantic entropy (SE)---successfully used for hallucination detection---offers a promising behavioral approach based on response consistency analysis. We hypothesize that jailbreak prompts create internal conflict between safety training and instruction-following, potentially manifesting as inconsistent responses with high semantic entropy. We systematically evaluate this approach using a black-box, embedding-based implementation of SE adapted from Farquhar et al.'s bidirectional entailment method to work within black-box constraints. Testing across two model families (Llama and Qwen) and two benchmarks (JailbreakBench, HarmBench), we find SE fails with 85-98% false negative rates, consistently outperformed by simpler baselines and exhibiting extreme hyperparameter sensitivity. We identify the primary failure mechanism as the "Consistency Confound": well-aligned models produce consistent, templated refusals that SE misinterprets as safe behavior, accounting for 73-97% of false negatives with high statistical confidence [95% Wilson CIs]. While SE's core assumption about response inconsistency indicating problematic content holds in limited cases, threshold brittleness renders it practically unreliable. Our results suggest that for this SE variant, response consistency may not be a reliable signal for jailbreak detection, as stronger alignment leads to more predictable outputs that confound this type of diversity-based detector.
Supplementary Material: zip
Submission Number: 151
Loading