ProKG: Triplet-Level Bayesian Reasoning over Knowledge Graphs for Robust LLM Safety

ProKG: Triplet-Level Bayesian Reasoning over Knowledge Graphs for Robust LLM Safety

ACL ARR 2026 January Submission6707 Authors

05 Jan 2026 (modified: 07 Jun 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM safety, prompt injection, jailbreak defense, knowledge graphs, Bayesian inference, neuro-symbolic reasoning, over-defense, interpretability

Abstract: Safety mechanisms for large language models (LLMs) must reason under semantic uncertainty, particularly when benign inputs contain surface patterns associated with prompt injection or jailbreak attacks. Existing defenses often rely on keyword heuristics, latent boundary modeling, or fixed-hop knowledge graph reasoning, leading to over-defense, brittle decisions, and limited interpretability. We introduce ProKG, a probabilistic knowledge graph reasoning framework that formulates LLM safety assessment as Bayesian inference over complete ⟨subject, predicate, object⟩ triplets. Each triplet is treated as a random variable with an explicit posterior belief, and uncertainty is propagated via joint and conditional Bayesian updates rather than deterministic rules. Reasoning scope expands adaptively based on posterior probability mass rather than fixed-hop traversal. Experiments on NOTINJECT and ADVBENCH show that ProKG substantially reduces false-positive refusals on benign trigger-containing inputs while preserving strong robustness against adversarial jailbreaks, at practical inference cost. These results demonstrate that triplet-level probabilistic reasoning provides a principled, interpretable foundation for robust LLM safety under uncertainty.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: safety and alignment, robustness, red teaming, security and privacy, neurosymbolic approaches, adversarial attacks

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Theory

Languages Studied: English

Submission Number: 6707

Loading