Keywords: Information forensics, instruction watermarking, prompt injection defense, session provenance
TL;DR: PromptHash is an instruction-side, paraphrase-tolerant watermark that cryptographically binds prompts to session context via a keyed hash chain and codebook rendering, enabling low-overhead, splice-robust provenance verification.
Abstract: Large language models (LLMs) increasingly operate in retrieval-augmented and multi-agent workflows where \emph{instruction provenance} is critical, yet adversaries can exploit \emph{cross-context splicing with paraphrasing} to evade attribution. Existing content/behavior detectors degrade once surface form changes, and output-side watermarks primarily target generations rather than instructions. We propose \emph{PromptHash}, a self-authenticating, instruction-side watermark that normalizes and segments prompts, computes a position-sensitive keyed hash chain bound to session metadata, and renders tags via a compact, semantics-preserving codebook with fuzzy verification tolerant to paraphrase and tokenization jitter. PromptHash is model-agnostic, deploys as a lightweight pre/post-processor, and introduces negligible cost. On the Paraphrase Attack Corpus (PAC), Splice-and-Reflow Benchmark (SRB), and Indirect Injection Testbed (IIT), PromptHash achieves TAR $98.3{\pm}0.4\%$, FAR $0.8{\pm}0.2\%$, and \RAR $96.6{\pm}0.6\%$ with sub-millisecond CPU latency and $<0.4\%$ token inflation, consistently surpassing detectors and adapted output watermarks. These results establish instruction-side watermarking as a practical primitive for accountable LLM session forensics, ensuring splice/edit integrity while preserving usability.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Supplementary Material: zip
Submission Number: 6487
Loading