Keywords: LLM safety, prompt injection, jailbreak defense, knowledge graphs, Bayesian inference, neuro-symbolic reasoning, over-defense, interpretability
Abstract: Safety mechanisms for large language models (LLMs) must reason under semantic uncertainty, particularly when benign inputs contain surface patterns associated with prompt injection or jailbreak attacks. Existing defenses often rely on keyword heuristics, latent boundary modeling, or fixed-hop knowledge graph reasoning, leading to over-defense, brittle decisions, and limited interpretability. We introduce ProKG, a probabilistic knowledge graph reasoning framework that formulates LLM safety assessment as Bayesian inference over complete ⟨subject, predicate, object⟩ triplets. Each triplet is treated as a random variable with an explicit posterior belief, and uncertainty is propagated via joint and conditional Bayesian updates rather than deterministic rules. Reasoning scope expands adaptively based on posterior probability mass rather than fixed-hop traversal. Experiments on NOTINJECT and ADVBENCH show that ProKG substantially reduces false-positive refusals on benign trigger-containing inputs while preserving strong robustness against adversarial jailbreaks, at practical inference cost. These results demonstrate that triplet-level probabilistic reasoning provides a principled, interpretable foundation for robust LLM safety under uncertainty.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: safety and alignment, robustness, red teaming, security and privacy, neurosymbolic approaches, adversarial attacks
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Theory
Languages Studied: English
Submission Number: 6707
Loading