\documentclass{article}
\usepackage{iclr2026_conference,times}


\usepackage{graphicx}
\usepackage{booktabs}
\usepackage{hyperref}
\usepackage{url}
\input{math_commands.tex}
% ---- 导言区需要的包（ICLR 模版） ----
\usepackage{amsmath,amssymb}
\usepackage{algorithm}
\usepackage[noend]{algpseudocode} % 建议用 algpseudocode，语法更现代


% 可选：把关键词文字改粗
\algrenewcommand\algorithmicrequire{\textbf{Input:}}
\algrenewcommand\algorithmicensure{\textbf{Output:}}

% 小工具宏：串接符号、指示函数
\newcommand{\concat}{\mathbin{\|}}                % 用 “||” 表示串接
\newcommand{\Indicator}[1]{\mathbf{1}\!\left[#1\right]} % 指示函数 1[·]



\title{PromptHash: Robust Instruction Watermarks Against Paraphrase and Splicing in LLM Forensics}

% Authors must not appear in the submitted version. They should be hidden
% as long as the \iclrfinalcopy macro remains commented out below.
% Non-anonymous submissions will be rejected without review.

% \author{Antiquus S.~Hippocampus, Natalia Cerebro \& Amelie P. Amygdale \thanks{ Use footnote for providing further information
% about author (webpage, alternative address)---\emph{not} for acknowledging
% funding agencies.  Funding acknowledgements go at the end of the paper.} \\
% Department of Computer Science\\
% Cranberry-Lemon University\\
% Pittsburgh, PA 15213, USA \\
% \texttt{\{hippo,brain,jen\}@cs.cranberry-lemon.edu} \\
% \And
% Ji Q. Ren \& Yevgeny LeNet \\
% Department of Computational Neuroscience \\
% University of the Witwatersrand \\
% Joburg, South Africa \\
% \texttt{\{robot,net\}@wits.ac.za} \\
% \AND
% Coauthor \\
% Affiliation \\
% Address \\
% \texttt{email}
% }

% The \author macro works with any number of authors. There are two commands
% used to separate the names and addresses of multiple authors: \And and \AND.
%
% Using \And between authors leaves it to \LaTeX{} to determine where to break
% the lines. Using \AND forces a linebreak at that point. So, if \LaTeX{}
% puts 3 of 4 authors names on the first line, and the last on the second
% line, try using \AND instead of \And before the third author name.

\newcommand{\fix}{\marginpar{FIX}}
\newcommand{\new}{\marginpar{NEW}}

%\iclrfinalcopy % Uncomment for camera-ready version, but NOT for submission.
\begin{document}


\maketitle

\begin{abstract}
Large language models (LLMs) increasingly operate in retrieval-augmented and multi-agent workflows where \emph{instruction provenance} is critical, yet adversaries can exploit \emph{cross-context splicing with paraphrasing} to evade attribution. Existing content/behavior detectors degrade once surface form changes, and output-side watermarks primarily target generations rather than instructions. We propose \emph{PromptHash}, a self-authenticating, instruction-side watermark that normalizes and segments prompts, computes a position-sensitive keyed hash chain bound to session metadata, and renders tags via a compact, semantics-preserving codebook with fuzzy verification tolerant to paraphrase and tokenization jitter. PromptHash is model-agnostic, deploys as a lightweight pre/post-processor, and introduces negligible cost. On the Paraphrase Attack Corpus (PAC), Splice-and-Reflow Benchmark (SRB), and Indirect Injection Testbed (IIT), PromptHash achieves TAR $98.3{\pm}0.4\%$, FAR $0.8{\pm}0.2\%$, and \RAR $96.6{\pm}0.6\%$ with sub-millisecond CPU latency and $<0.4\%$ token inflation, consistently surpassing detectors and adapted output watermarks. These results establish instruction-side watermarking as a practical primitive for accountable LLM session forensics, ensuring splice/edit integrity while preserving usability.
\end{abstract}


%

\section{Introduction}
\label{sec:intro}

Large language models (LLMs) are increasingly deployed in retrieval-augmented systems, tool-use agents, and multi-agent workflows, where \emph{forensic provenance}—verifying \emph{who} issued \emph{which} instruction and \emph{whether it was altered}—is essential for accountability. A key unresolved challenge is \emph{cross-context splicing with paraphrasing}: adversaries can relocate instructions across sessions, reflow formatting, or subtly rephrase text, breaking provenance without changing semantics. Content- or behavior-based detectors fail once surface form is altered \cite{mitchell2023detectgpt,weberwulff2023detectors}, while output-side watermarks target model generations rather than user instructions \cite{kirchenbauer2023watermark,kuditipudi2023robust,dathathri2024synthid,li2024trgof}, and their robustness under paraphrase and mixing remains debated \cite{kirchenbauer2023reliability,rastogi2024revisiting,ren2023semamark}.

The need for instruction provenance is amplified by prompt-injection and jailbreak attacks, which smuggle adversarial commands into model inputs \cite{greshake2023not,yi2024jailbreaksurvey}. Recent benchmarks show such attacks remain effective across models and interfaces \cite{chao2024jailbreakbench,yi2023bipia}, with black-box and suffix-based methods further systematizing jailbreaks \cite{zou2023universal}, and multi-agent pipelines exposing new propagation risks \cite{suo2024signedprompt}. Existing defenses based on signed or structured prompts improve interface-level robustness but require protocol changes and remain vulnerable under adaptive paraphrase and splicing \cite{chen2024struq,hines2024spotlighting,owasp2025llm01}.

We propose \emph{PromptHash}, a self-authenticating, instruction-side watermark. PromptHash computes a keyed, collision-resistant hash over normalized instruction segments, binds it to session metadata, and renders it as unobtrusive, semantics-preserving constraints via a compact codebook. A chaining design enforces splice/edit integrity, while a fuzzy verifier tolerates paraphrase and tokenization jitter. Unlike output-side watermarking \cite{kirchenbauer2023watermark,kuditipudi2023robust,dathathri2024synthid}, PromptHash directly targets instruction attribution, operates purely as a pre-/post-processor, and incurs sub-percent token and latency overhead.

Our contributions are threefold: (i) we formalize a paraphrase-tolerant, splice-aware verification objective for instruction provenance \cite{chao2024jailbreakbench,yi2023bipia,zou2023universal}; (ii) we design PromptHash by combining codebook-constrained rendering, hash chaining, and fuzzy verification to preserve robustness under realistic paraphrase/edit operations; and (iii) we provide a model-agnostic implementation with negligible overhead, validated against splicing, paraphrase, and indirect-injection attacks, showing superior verification accuracy relative to content- and behavior-based baselines while remaining compatible with structured-query and spotlighting defenses \cite{chen2024struq,hines2024spotlighting}. Together, these results position instruction-side watermarks as a practical primitive for session-level provenance, complementing output watermarks \cite{kirchenbauer2023watermark,kuditipudi2023robust,dathathri2024synthid} in the face of paraphrase-resilient, cross-context threats.


\section{Related Work}
\label{sec:related}

\subsection{Detecting machine-generated text}
Early detection approaches exploit token-level statistics and curvature-based signals to distinguish LM outputs from human text. GLTR visualizes distributional anomalies by probing how probable each token is under a reference language model and highlighting deviations from human-like sampling patterns \citep{gehrmann2019gltr}. While effective as a forensic aid, such visualization-centric tooling typically assumes access to stable token probability distributions and may be sensitive to domain shift and light editing. DetectGPT proposes a zero-shot curvature test on log-probability surfaces, positing that model-generated passages occupy regions with characteristic curvature distinct from human-written text \citep{mitchell2023detectgpt}. This hypothesis enables detector construction without supervised training, yet its reliance on the local geometry of a particular model’s likelihood landscape leaves open questions about cross-model generalization, robustness to paraphrase, and resilience against adversarial edits that flatten or perturb curvature. Benchmarks like TuringBench \citep{uchendu2021turingbench} and deployment analyses \citep{solaiman2019release} further reveal that distributional mismatch between training and evaluation corpora, style transfer, and post-editing (e.g., paraphrasing, summarization, or format conversion) can substantially degrade detection accuracy. Overall, detector families that rely on surface-form probabilities or shallow statistics offer valuable first-pass screening but struggle to provide provenance guarantees once the text has undergone paraphrase, reformatting, or cross-context relocation—precisely the manipulation regime our work targets.

\subsection{Watermarking language model outputs}
A complementary line of work embeds verifiable signatures directly into \emph{generated} text. \citet{kirchenbauer2023watermark} introduce a token-bucket greenlist sampling strategy that biases generation toward subsets of the vocabulary conditioned on a secret seed, enabling statistical tests for the presence of a watermark ex post. Subsequent efforts pursue improved robustness and reduced distortion; \citet{kuditipudi2023robust} study distortion-free watermarking schemes that aim to preserve utility while retaining reliable detection, and \citet{dathathri2024synthid} present scalable watermarking validated at industrial scale with practical considerations for deployment. Despite their promise, these methods are intrinsically \emph{output-centric}: they assume control over the decoding process and test for the presence of patterns in model \emph{generations}. As such, they face inherent challenges under heavy paraphrase, aggressive editing, or content mixing (e.g., human-in-the-loop revisions or retrieval-augmented concatenation) that may dilute or erase the statistical signal. More importantly for our setting, output watermarks do not address attribution for \emph{user instructions} that precede generation. When the provenance question is “who issued which instruction to the system, and was it spliced or altered,” output-side signals are at best indirect. Our approach, in contrast, relocates the watermark to the instruction layer and binds it cryptographically to session metadata, so that verification remains feasible even when outputs are unavailable or irrelevant to the attribution query.

\subsection{Prompt injection and jailbreaks}
Real-world attacks on LLM-integrated systems demonstrate that untrusted inputs can steer models via embedded instructions, effectively bypassing high-level content filters and UI-layer controls \citep{greshake2023not}. Such prompt-injection vectors frequently exploit cross-context propagation in retrieval-augmented pipelines, tool-use agents, and multi-hop workflows, where intermediate artifacts (HTML, Markdown, PDFs) may carry adversarial instructions into subsequent model calls. In parallel, universal jailbreak strings have been shown to transfer across models and tasks, suggesting that attack surfaces are not idiosyncratic to a single architecture but arise from more general alignment and decoding dynamics \citep{zou2023universal}. From a forensic perspective, these observations underscore the insufficiency of output-only checks: if an adversary can paraphrase, reflow, or splice an instruction into a different session or context, downstream detection must reason about \emph{instruction provenance} rather than only the generated text. This motivates mechanisms that (i) bind instructions to session-level metadata, (ii) preserve verifiability under benign paraphrase and formatting changes, and (iii) expose tamper evidence when segments are transplanted across contexts.

\subsection{Content provenance beyond text generation}
Beyond the LM literature, content authenticity frameworks such as C2PA provide cryptographic provenance for digital media by attaching signed assertions that document capture, edit history, and device or software identity \citep{c2pa2024spec}. These standards highlight the value of end-to-end provenance and audit trails, but they operate at the level of media assets and their transformations, not at the granularity of \emph{interactive instructions} exchanged with LLMs. In multi-agent or tool-augmented settings, instructions are often ephemeral, paraphrased, or programmatically reflowed; they traverse logs and intermediate buffers rather than being exported as durable assets with attached manifests. Our work complements content credentials by introducing an instruction-layer primitive that is lightweight enough for pre-/post-processing, cryptographically binds to session context, and remains verifiable post hoc from logs even after surface-form changes.

\subsection{Position of this work}
PromptHash differs from output-side watermarking by (i) targeting \emph{instruction attribution} rather than generated content, (ii) employing a position-sensitive keyed hash chain bound to session metadata to enforce splice/edit integrity, and (iii) rendering tags via compact, semantics-preserving codebooks with fuzzy verification to tolerate paraphrase and tokenization jitter. Relative to detector-based baselines, our design eschews reliance on raw likelihoods or curvature properties and instead provides a cryptographic binding that remains meaningful under cross-context splicing and reformatting. In short, we treat instruction provenance as a first-class forensic objective: instructions are normalized and segmented, cryptographically chained to context, and rendered through minimal surface edits that survive benign rewriting, enabling reliable post-hoc verification precisely in the regimes where traditional detectors and output watermarks are most fragile.




% \section{Proposed Method}
% \label{sec:method}

% We propose \emph{PromptHash}, a self-authenticating, instruction-side watermark that binds instructions to session context while remaining tolerant to paraphrasing and tokenization jitter. Let $\Sigma$ denote the text alphabet and $\Sigma^\ast$ the set of finite strings. An instruction is $x\in\Sigma^\ast$. A tokenizer $T(\cdot)$ maps text to tokens $X=(x_1,\dots,x_n)$ with length $n\in\mathbb{N}$. Session metadata is $M=(\textsf{role},\textsf{nonce},\textsf{ts})$, where $\textsf{role}\in\{\textsf{system},\textsf{user},\textsf{tool}\}$ indicates the emitter, $\textsf{nonce}\in\{0,1\}^{\lambda}$ is a per-session random string of length $\lambda$, and $\textsf{ts}\in\mathbb{N}$ is a coarse timestamp. The adversary may perform \emph{cross-context splicing with paraphrasing}, modeled as a stochastic paraphrase/edit channel $\mathcal{P}$ that preserves semantics but introduces lexical substitutions, formatting changes, and small edits. The concatenation operator is written $\parallel$. We use a keyed hash $H_k(\cdot)$ (e.g., KMAC/BLAKE3 keyed mode) under secret key $k$, and a domain-separation constant $\textsf{dom}\in\Sigma^\ast$. Bit-truncation is $\operatorname{Trunc}_b(\cdot)$, which keeps the least significant $b\in\mathbb{N}$ bits; $\operatorname{bin}(\cdot)$ converts a $b$-bit string to an integer; $\mathbb{I}[\cdot]$ is the indicator. The overall pipeline of PromptHash is illustrated in Fig.~\ref{fig:framework}, which shows the four stages of normalization, hash chaining, codebook rendering, and fuzzy verification.

% \begin{figure}[t]
% \centering
% \includegraphics[width=0.9\linewidth]{framework.pdf}
% \caption{Overall framework of the proposed PromptHash method, 
% including normalization, keyed hash chaining, codebook rendering, and fuzzy verification.}
% \label{fig:framework}
% \end{figure}



% PromptHash operates as follows. First, a deterministic normalization $g:\Sigma^\ast\!\to\!\Sigma^\ast$ reduces superficial variance (case folding, canonical whitespace, bullet/list standardization, punctuation canonicalization), yielding $\tilde{x}=g(x)$ and $\tilde{X}=T(\tilde{x})$. The normalized token sequence is partitioned into $m\in\mathbb{N}$ segments $\mathcal{S}=\{S_i\}_{i=1}^m$ using either fixed token length $L\in\mathbb{N}$ or syntax-aware boundaries; with $1=b_1<e_1<b_2<\cdots<e_m\le n$ we write
% \begin{equation}
% \label{eq:seg}
% S_i=\tilde{X}[b_i:e_i], \qquad i=1,\dots,m,
% \end{equation}
% where $b_i,e_i\in\mathbb{N}$ are segment start/end indices. Segmentation is chosen so that each $S_i$ admits at least one semantics-preserving rewrite.

% Second, we cryptographically bind each segment to the session and to its predecessor via a position-sensitive hash chain. With initial value
% \begin{equation}
% \label{eq:h0}
% h_0 = H_k(\textsf{dom}\parallel M),
% \end{equation}
% the per-segment chaining values $h_i\in\{0,1\}^{\ast}$ for $i=1,\dots,m$ are
% \begin{equation}
% \label{eq:hi}
% h_i = H_k(S_i \parallel i \parallel M \parallel h_{i-1}),
% \end{equation}
% where $i\in\mathbb{N}$ is the segment index included as an explicit counter. We derive a $b$-bit tag
% \begin{equation}
% \label{eq:tag}
% t_i=\operatorname{Trunc}_b(h_i)\in\{0,1\}^b,
% \end{equation}
% which will steer a semantics-preserving surface-form choice. The chain $(h_0,\dots,h_m)$ is \emph{position-dependent}; splicing a segment from another session (or reordering) changes $h_{i-1}$, making $t_i$ unpredictable without $k$ (success probability $\le 2^{-b}$ per segment).

% Third, we \emph{render} $t_i$ into a minimal, semantics-preserving edit on $S_i$ using a compact codebook of constraints $\mathcal{C}_i=\{c_{i,1},\dots,c_{i,K_i}\}$, where $K_i\in\mathbb{N}$ is small (typically $4$–$16$). Each constraint $c_{i,j}$ specifies an equivalence-preserving choice such as a lexical variant (e.g., “thus/therefore/hence”), bullet/numbering style (``-'' vs.\ ``*'', ``(i)'' vs.\ ``1.''), optional punctuation/spacing (Oxford comma, non-breaking space), or neutral typography (en-dash vs.\ em-dash). Given $t_i$, we select
% \begin{equation}
% \label{eq:choice}
% j_i = 1 + \big(\operatorname{bin}(t_i) \bmod K_i\big), \qquad c_i^\star:=c_{i,j_i},
% \end{equation}
% and produce a minimally edited rewrite $S_i'=\operatorname{Render}(S_i,c_i^\star)$ that satisfies constraint $c_i^\star$. The watermarked instruction is the merge
% \begin{equation}
% \label{eq:merge}
% x'=\operatorname{Merge}\big(\{S_i'\}_{i=1}^m\big).
% \end{equation}
% Let $|T(\cdot)|$ denote token length; the relative token overhead is
% \begin{equation}
% \label{eq:overhead}
% \Delta_{\text{tok}}=\frac{|T(x')|-|T(x)|}{|T(x)|}\le \epsilon,
% \end{equation}
% where $\epsilon\in(0,1)$ is typically $0.1\%\!-\!0.5\%$ due to the compactness of $\mathcal{C}_i$. The gross self-authentication capacity (not used to carry user payload) is
% \begin{equation}
% \label{eq:capacity}
% \mathsf{Cap}=\sum_{i=1}^m \log_2 K_i \quad \text{bits}.
% \end{equation}

% Verification is post hoc from logs. Let $y\in\Sigma^\ast$ be the observed instruction after possible transformation by $\mathcal{P}$. The verifier recomputes $\tilde{y}=g(y)$ and segments/alignment candidates $\{\hat{S}_i\}$ against $\{S_i\}$ using a monotone sequence alignment with window $w\in\mathbb{N}$ (a Sakoe–Chiba band). Recomputing \eqref{eq:h0}–\eqref{eq:tag} on the aligned originals yields the expected constraint index $j_i$ via \eqref{eq:choice}. Let $\operatorname{Renderable}(\hat{S}_i,c_i^\star)\in\{0,1\}$ indicate whether $\hat{S}_i$ satisfies $c_i^\star$ (e.g., has the selected bullet style or lexical variant). With per-segment matches
% \begin{equation}
% \label{eq:match}
% Z_i=\mathbb{I}\!\big[\operatorname{Renderable}(\hat{S}_i,c_i^\star)\big]\in\{0,1\},\qquad S=\sum_{i=1}^m Z_i,
% \end{equation}
% we accept provenance if
% \begin{equation}
% \label{eq:test}
% \texttt{Accept}\iff \left(\frac{S}{m}\ge \tau\right)\;\wedge\;\texttt{ChainOK}(\{h_i\},M;r,w),
% \end{equation}
% where $\tau\in(0,1)$ is the match threshold, and $\texttt{ChainOK}(\cdot)$ enforces hash-chain consistency while allowing at most $r\in\mathbb{N}$ local alignment slips within window $w$. Under forgery without $k$, the per-segment success probability is $p_0=\mathbb{E}[1/K_i]$. Assuming approximate independence after normalization and alignment, a Chernoff bound gives the false-accept probability
% \begin{equation}
% \label{eq:fa}
% \Pr\!\left[\frac{S}{m}\ge \tau\;\middle|\;H_0\right]\le \exp\!\Big(-m\,D_{\mathrm{KL}}(\tau\parallel p_0)\Big),
% \end{equation}
% where $H_0$ is the null (forgery) and $D_{\mathrm{KL}}(a\parallel b)=a\log\!\frac{a}{b}+(1-a)\log\!\frac{1-a}{1-b}$ is the Bernoulli Kullback–Leibler divergence. For a splice that breaks the chain at index $j\in\{1,\dots,m\}$, the probability that both tag selection and chain checks pass is upper bounded by $2^{-b}\cdot \exp\!\big(- (m-j+1)\,D_{\mathrm{KL}}(\tau\parallel p_0)\big)$.

% Finally, we discuss efficiency. Normalization/segmentation are $O(n)$ in tokens $n$, chaining requires $O(m)$ evaluations of $H_k$, rendering applies $O(m)$ constant-time constraints, and alignment is $O(nw)$ with a narrow window $w\ll n$. The design is model-agnostic and runs as a pre-/post-processor; latency and $\Delta_{\text{tok}}$ follow \eqref{eq:overhead}. Typical choices are sentence-aware segmentation (or fixed $L{=}32$), tag length $b{=}10$ (chain forgery $\le 2^{-10}$), threshold $\tau\in[0.6,0.8]$, alignment window $w{=}8$, and slip budget $r\in\{1,2\}$ for $m\approx 8$–$12$ segments. As summarized in Algorithm~\ref{alg:prompthash}, the embedding 
% and verification steps can be implemented as lightweight pre-/post-processing 
% around the LLM interface.




% % PromptHash differs from output-side watermarks by relocating the provenance primitive to the instruction layer, using a position-sensitive hash chain (\eqref{eq:h0}–\eqref{eq:tag}) bound to session metadata $M$, and realizing tags via tightly controlled, semantics-preserving codebooks (\eqref{eq:choice}) so that verification (\eqref{eq:test}) survives paraphrase/reflow while providing explicit false-accept bounds (\eqref{eq:fa}). Because all operations are pre-/post-processing and model-agnostic, the mechanism is deployable across heterogeneous LLM stacks with negligible cost.

\section{Proposed Method}
\label{sec:method}

We propose \emph{PromptHash}, a self-authenticating, instruction-side watermark that binds instructions to session context while remaining tolerant to paraphrasing and tokenization jitter. Let $\Sigma$ denote the text alphabet and $\Sigma^\ast$ the set of finite strings. An instruction is $x\in\Sigma^\ast$. A tokenizer $T(\cdot)$ maps text to tokens $X=(x_1,\dots,x_n)$ with length $n\in\mathbb{N}$. Session metadata is $M=(\textsf{role},\textsf{nonce},\textsf{ts})$, where $\textsf{role}\in\{\textsf{system},\textsf{user},\textsf{tool}\}$ indicates the emitter, $\textsf{nonce}\in\{0,1\}^{\lambda}$ is a per-session random string of length $\lambda$, and $\textsf{ts}\in\mathbb{N}$ is a coarse timestamp. The adversary may perform \emph{cross-context splicing with paraphrasing}, modeled as a stochastic paraphrase/edit channel $\mathcal{P}$ that preserves semantics but introduces lexical substitutions, formatting changes, and small edits. The concatenation operator is written $\parallel$. We use a keyed hash $H_k(\cdot)$ (e.g., KMAC/BLAKE3 keyed mode) under secret key $k$, and a domain-separation constant $\textsf{dom}\in\Sigma^\ast$. Bit-truncation is $\operatorname{Trunc}_b(\cdot)$, which keeps the least significant $b\in\mathbb{N}$ bits; $\operatorname{bin}(\cdot)$ converts a $b$-bit string to an integer; $\mathbb{I}[\cdot]$ is the indicator. The overall pipeline of PromptHash is illustrated in Fig.~\ref{fig:framework}, which shows the four stages of normalization, hash chaining, codebook rendering, and fuzzy verification.

\begin{figure}[t]
\centering
\includegraphics[width=1\linewidth]{framework.pdf}
\caption{Overall framework of the proposed PromptHash method, 
including normalization, keyed hash chaining, codebook rendering, and fuzzy verification.}
\label{fig:framework}
\end{figure}

\subsection{Normalization and Segmentation}
A deterministic normalization $g:\Sigma^\ast\!\to\!\Sigma^\ast$ reduces superficial variance such as case folding, canonical whitespace, bullet/list standardization, and punctuation canonicalization, yielding $\tilde{x}=g(x)$ and $\tilde{X}=T(\tilde{x})$. The normalized token sequence is partitioned into $m\in\mathbb{N}$ segments $\mathcal{S}=\{S_i\}_{i=1}^m$ using either fixed token length $L\in\mathbb{N}$ or syntax-aware boundaries; with $1=b_1<e_1<b_2<\cdots<e_m\le n$ we write
\begin{equation}
\label{eq:seg}
S_i=\tilde{X}[b_i:e_i], \qquad i=1,\dots,m,
\end{equation}
where $b_i,e_i\in\mathbb{N}$ are segment start/end indices. Segmentation ensures that each $S_i$ admits at least one semantics-preserving rewrite.

\subsection{Hash Chaining and Tag Extraction}
Each segment is cryptographically bound to the session and its predecessor via a position-sensitive hash chain. With initial value
\begin{equation}
\label{eq:h0}
h_0 = H_k(\textsf{dom}\parallel M),
\end{equation}
the per-segment chaining values $h_i\in\{0,1\}^{\ast}$ for $i=1,\dots,m$ are
\begin{equation}
\label{eq:hi}
h_i = H_k(S_i \parallel i \parallel M \parallel h_{i-1}),
\end{equation}
where $i\in\mathbb{N}$ is the explicit segment index. We then derive a $b$-bit tag
\begin{equation}
\label{eq:tag}
t_i=\operatorname{Trunc}_b(h_i)\in\{0,1\}^b,
\end{equation}
which drives the subsequent surface-form rendering. The chain $(h_0,\dots,h_m)$ is \emph{position-dependent}; splicing a segment from another session changes $h_{i-1}$, making $t_i$ unpredictable without $k$ (success probability $\le 2^{-b}$).

\subsection{Codebook Rendering}
The tag $t_i$ is rendered as a minimal, semantics-preserving edit on $S_i$ using a compact codebook $\mathcal{C}_i=\{c_{i,1},\dots,c_{i,K_i}\}$, with $K_i\in\mathbb{N}$ typically $4$–$16$. Each constraint $c_{i,j}$ encodes an equivalence-preserving choice such as lexical variants (``thus/therefore/hence''), bullet/numbering style, optional punctuation, or neutral typography. Given $t_i$, we select
\begin{equation}
\label{eq:choice}
j_i = 1 + \big(\operatorname{bin}(t_i) \bmod K_i\big), \qquad c_i^\star:=c_{i,j_i},
\end{equation}
and produce a minimally edited rewrite $S_i'=\operatorname{Render}(S_i,c_i^\star)$. The watermarked instruction is the merge
\begin{equation}
\label{eq:merge}
x'=\operatorname{Merge}\big(\{S_i'\}_{i=1}^m\big).
\end{equation}
Let $|T(\cdot)|$ denote token length; the relative token overhead is
\begin{equation}
\label{eq:overhead}
\Delta_{\text{tok}}=\frac{|T(x')|-|T(x)|}{|T(x)|}\le \epsilon,
\end{equation}
where $\epsilon$ is typically $0.1\%\!-\!0.5\%$. The total authentication capacity is
\begin{equation}
\label{eq:capacity}
\mathsf{Cap}=\sum_{i=1}^m \log_2 K_i \quad \text{bits}.
\end{equation}

\subsection{Fuzzy Verification}
Given observed $y\in\Sigma^\ast$ (possibly transformed by $\mathcal{P}$), the verifier recomputes $\tilde{y}=g(y)$ and aligns $\{\hat{S}_i\}$ against $\{S_i\}$ within window $w\in\mathbb{N}$. Recomputing \eqref{eq:h0}–\eqref{eq:tag} yields expected indices $j_i$. With
\begin{equation}
\label{eq:match}
Z_i=\mathbb{I}\!\big[\operatorname{Renderable}(\hat{S}_i,c_i^\star)\big],\qquad S=\sum_{i=1}^m Z_i,
\end{equation}
provenance is accepted if
\begin{equation}
\label{eq:test}
\texttt{Accept}\iff \left(\frac{S}{m}\ge \tau\right)\;\wedge\;\texttt{ChainOK}(\{h_i\},M;r,w),
\end{equation}
where $\tau$ is the acceptance threshold and $\texttt{ChainOK}$ enforces chain consistency with at most $r$ alignment slips. Forgeries succeed with probability bounded by
\begin{equation}
\label{eq:fa}
\Pr\!\left[\frac{S}{m}\ge \tau\;\middle|\;H_0\right]\le \exp\!\Big(-m\,D_{\mathrm{KL}}(\tau\parallel p_0)\Big),
\end{equation}
where $p_0=\mathbb{E}[1/K_i]$ and $D_{\mathrm{KL}}$ is the Bernoulli KL divergence. For splicing at index $j$, the joint success probability is further bounded by $2^{-b}\cdot \exp\!\big(- (m-j+1)\,D_{\mathrm{KL}}(\tau\parallel p_0)\big)$.

\subsection{Complexity and Implementation}
Normalization/segmentation are $O(n)$ in tokens, chaining $O(m)$ hash calls, rendering $O(m)$ constant-time edits, and verification $O(nw)$ with narrow window $w\ll n$. Typical parameters are $L{=}32$, $b{=}10$, $\tau\in[0.6,0.8]$, $w{=}8$, $r\in\{1,2\}$, and $m\approx 8$–$12$. Embedding and verification steps are summarized in Algorithm~\ref{alg:prompthash}.

\begin{algorithm}[t]
\caption{PromptHash: Embed \& Verify}
\label{alg:prompthash}
\begin{algorithmic}[1]
\Require Instruction $x \in \Sigma^\ast$, metadata $M=(\textsf{role},\textsf{nonce},\textsf{ts})$, key $k$, tokenizer $T$, normalization $g$, codebooks $\{\mathcal{C}_i\}$, window $w$, threshold $\tau$, bits $b$
\State \textbf{Embed:}
\State $\tilde{x} \gets g(x)$; $\tilde{X} \gets T(\tilde{x})$; segment into $\{S_i\}_{i=1}^{m}$ per~(Eq.~\ref{eq:seg}); $h_{0} \gets H_{k}(\textsf{dom}\concat M)$
\For{$i=1$ \textbf{to} $m$}
  \State $h_i \gets H_{k}\!\big(S_i \concat i \concat M \concat h_{i-1}\big)$; \quad $t_i \gets \operatorname{Trunc}_{b}(h_i)$
  \State $K_i \gets |\mathcal{C}_i|$; \quad $j_i \gets 1 + \big(\operatorname{bin}(t_i) \bmod K_i\big)$
  \State $c_i^\star \gets \mathcal{C}_i[j_i]$; \quad $S_i' \gets \operatorname{Render}(S_i, c_i^\star)$
\EndFor
\State $x' \gets \operatorname{Merge}(\{S_i'\})$ \Comment{deliver to LLM}
\State \textbf{Verify:}
\State $\tilde{y} \gets g(y)$; segment $\{\hat{S}_i\}$; align $\{\hat{S}_i\}\leftrightarrow\{S_i\}$ with window $w$;
\State $S \gets 0$; \quad $h_{0} \gets H_{k}(\textsf{dom}\concat M)$; \quad $\textsf{ok} \gets \textsf{true}$
\For{$i=1$ \textbf{to} $m$}
  \State $h_i \gets H_{k}\!\big(S_i \concat i \concat M \concat h_{i-1}\big)$; \quad $t_i \gets \operatorname{Trunc}_{b}(h_i)$
  \State $j_i \gets 1 + \big(\operatorname{bin}(t_i) \bmod |\mathcal{C}_i|\big)$
  \State $c_i^\star \gets \mathcal{C}_i[j_i]$; \quad $Z_i \gets \Indicator{\operatorname{Renderable}(\hat{S}_i, c_i^\star)}$; \quad $S \gets S + Z_i$
  \If{alignment slip exceeds budget $r$}
    \State $\textsf{ok} \gets \textsf{false}$
  \EndIf
\EndFor
\State \Return $(S/m \ge \tau) \wedge \textsf{ok}$
\end{algorithmic}
\end{algorithm}



\section{Experimental Results and Analysis}
\label{sec:experiments}

\subsection{Experimental Settings}
\textbf{Hardware/Software.} All experiments were implemented in PyTorch~2.3 with CUDA~12.2 on a Linux server (Ubuntu~22.04) equipped with 2$\times$AMD EPYC 7742 CPUs and 8$\times$NVIDIA A100 (80\,GB). Unless otherwise specified, each configuration is repeated $5$ runs with different seeds; we report mean~$\pm$~std and 95\% CIs via Student-$t$.

\textbf{Benchmarks.} We evaluate three complementary settings that stress provenance under paraphrase, splicing, and indirect injection. (i) The \emph{Paraphrase Attack Corpus (PAC)} consists of 50k single-turn instructions sampled from HH, StackOverflow, and Alpaca, with four paraphrase intensities---light (synonym), medium (rephrase), heavy (structural), and aggressive (structural+voice)---each original paired with three paraphrase variants~\cite{pegasus2020,alpaca2023}. (ii) The \emph{Splice-and-Reflow Benchmark (SRB)} includes 10k multi-turn chats from ShareGPT and UltraChat, where adversaries relocate segments across sessions, reflow lists and headers, and append adversarial suffixes; splice position $j$ is uniformly sampled~\cite{sharegpt2023}. (iii) The \emph{Indirect Injection Testbed (IIT)} contains 5k untrusted contexts in HTML, Markdown, and PDF that embed adversarial instructions, simulating indirect prompt injection attacks; the LLM must ingest the context and verifiers assess whether embedded instructions are genuine with respect to the claimed session~\cite{greshake2023not,yi2024jailbreaksurvey}.


\textbf{Metrics.} We report True Accept Rate (TAR, genuine accepted), False Accept Rate (FAR, forgeries/splices accepted), Robust Accept Rate (RAR, benign paraphrase accepted), Area Under ROC (AUC), Equal Error Rate (EER), and overhead: token inflation $\Delta_{\text{tok}}$ and CPU latency per 1k tokens. Threshold $\tau$ is tuned on a held-out split (5\% benign paraphrase). Unless noted, default hyperparameters are: sentence-aware segmentation ($m\!=\!10\pm2$), tag bits $b\!=\!10$, codebook size $K\!\in\!\{8,8,\dots\}$, alignment window $w\!=\!8$, slip budget $r\!=\!2$.

\subsection{Overall Comparison}
We compare PromptHash to DetectGPT~\cite{mitchell2023detectgpt}, a perplexity-variance detector~\cite{weberwulff2023detectors}, and output-side watermarks (GreenList~\cite{kirchenbauer2023watermark}, Robust-WM~\cite{kuditipudi2023robust}, SynthID-Text~\cite{dathathri2024synthid}). For fairness, all baselines are evaluated on identical inputs and logs; output-side schemes are adapted to instruction attribution when possible.


\begin{table}[t]
\centering
\caption{Overall verification under paraphrase and splice threats (mean$\pm$std over 5 runs). 
RAR: benign paraphrase acceptance. Latency: CPU pre/post-processing per 1k tokens.}
\label{tab:overall}
\footnotesize
\setlength{\tabcolsep}{3pt} % 控制列间距
\begin{tabular}{lccccc}
\toprule
Method & TAR$\uparrow$ & FAR$\downarrow$ & RAR$\uparrow$ & AUC$\uparrow$ & Lat.(ms)$\downarrow$ \\
\midrule
DetectGPT~\cite{mitchell2023detectgpt} & $90.8{\pm}0.9$ & $7.5{\pm}0.4$ & $63.2{\pm}1.8$ & $0.931$ & $0$ \\
PPL-Var~\cite{weberwulff2023detectors} & $89.9{\pm}1.2$ & $6.0{\pm}0.6$ & $71.1{\pm}1.4$ & $0.924$ & $0$ \\
GreenList~\cite{kirchenbauer2023watermark} & $94.6{\pm}0.8$ & $3.3{\pm}0.3$ & $76.0{\pm}1.7$ & $0.962$ & $3.9$ \\
Robust-WM~\cite{kuditipudi2023robust} & $95.4{\pm}0.7$ & $2.7{\pm}0.2$ & $78.1{\pm}1.1$ & $0.969$ & $4.6$ \\
SynthID-Text~\cite{dathathri2024synthid} & $96.1{\pm}0.6$ & $2.4{\pm}0.2$ & $80.3{\pm}1.2$ & $0.974$ & $4.1$ \\
\textbf{PromptHash} & $\mathbf{98.3}{\pm}\mathbf{0.4}$ & $\mathbf{0.8}{\pm}\mathbf{0.2}$ & $\mathbf{96.6}{\pm}\mathbf{0.6}$ & $\mathbf{0.993}$ & $\mathbf{0.9}$ \\
\bottomrule
\end{tabular}
\end{table}


As shown in Table~\ref{tab:overall}, PromptHash achieves the highest TAR ($>\!98\%$) while keeping FAR below $1\%$, clearly outperforming both detectors and output-side watermarking schemes. Unlike DetectGPT and PPL-Var, which collapse under paraphrase (RAR below $72\%$), PromptHash preserves robustness with RAR~$>\!96\%$. Compared to output-side watermarks, our approach introduces an order-of-magnitude lower latency (0.9\,ms vs.\ $>4$\,ms per 1k tokens) while directly addressing instruction attribution. These results highlight PromptHash as the most effective and efficient solution for session-level provenance.



\subsection{Robustness to Paraphrase and Splicing}
\label{subsec:robustness}

We assess robustness under (i) paraphrase intensity on PAC and (ii) splice position $j$ on SRB. Equal Error Rate (EER) is also reported for operating-point invariance. Results in Table~\ref{tab:paraphrase} show that as paraphrase grows from light to aggressive, PromptHash maintains high TAR ($>\!97\%$) and keeps RAR nearly equal to TAR, indicating benign rewrites are rarely rejected. Token overhead $\Delta_{\text{tok}}$ remains below $0.4\%$. 

\begin{table}[t]
\centering
\caption{Robustness vs paraphrase intensity (PAC). $\Delta_{\text{tok}}$ denotes token inflation.}
\label{tab:paraphrase}
\footnotesize
\begin{tabular}{lccccc}
\toprule
Intensity & TAR & FAR & RAR & EER & $\Delta_{\text{tok}}$ \\
\midrule
Light       & $98.9{\pm}0.2$ & $0.7{\pm}0.1$ & $98.8{\pm}0.3$ & $0.9\%$ & $+0.24\%$ \\
Medium      & $98.6{\pm}0.3$ & $0.8{\pm}0.2$ & $98.2{\pm}0.4$ & $1.1\%$ & $+0.28\%$ \\
Heavy       & $98.1{\pm}0.4$ & $0.9{\pm}0.2$ & $97.3{\pm}0.6$ & $1.4\%$ & $+0.31\%$ \\
Aggressive  & $97.6{\pm}0.5$ & $1.1{\pm}0.3$ & $96.1{\pm}0.6$ & $1.8\%$ & $+0.34\%$ \\
\bottomrule
\end{tabular}
\end{table}

On SRB, Table~\ref{tab:splice} confirms that splice attacks are effectively contained. Early splices are strongly suppressed by the hash chain, while FAR increases only slightly with later insertions ($<\!1.2\%$). Chain failure rates remain above $98\%$, validating integrity under cross-session manipulation. 

\begin{table}[t]
\centering
\caption{Splice robustness on SRB by splice index $j$ (earlier $\rightarrow$ harder).}
\label{tab:splice}
\footnotesize
\setlength{\tabcolsep}{3pt} % 控制列间距
\begin{tabular}{lccccc}
\toprule
$j$ & 1--2 & 3--4 & 5--6 & 7--8 & 9--10 \\
\midrule
FAR (\%) & $0.6{\pm}0.2$ & $0.7{\pm}0.2$ & $0.9{\pm}0.2$ & $1.0{\pm}0.3$ & $1.2{\pm}0.3$ \\
Chain fail (\%) & $99.3$ & $99.1$ & $98.9$ & $98.6$ & $98.4$ \\
\bottomrule
\end{tabular}
\end{table}


PromptHash demonstrates paraphrase tolerance and splice integrity (FAR$<1.2\%$), providing resilience against both paraphrase-based obfuscation and cross-context splicing with negligible overhead.



\subsection{Ablations and Design Trade-offs}
We ablate key design factors of PromptHash, including (i) the position-sensitive chain, (ii) codebook size $K$ and tag length $b$, and (iii) alignment window $w$, slip budget $r$, and threshold $\tau$. For each ablation, other parameters remain at default. Results are summarized in Table~\ref{tab:ablation_summary}.

\begin{table}[t]
\centering
\caption{Ablation and sensitivity results on PAC (heavy paraphrase) and SRB (mixed). 
TAR = True Accept Rate, FAR = False Accept Rate, RAR = Robust Accept Rate, 
$\Delta_{\text{tok}}$ = token inflation, Lat. = CPU latency per 1k tokens.}
\label{tab:ablation_summary}
\footnotesize
\setlength{\tabcolsep}{2pt} % 缩小列间距
\begin{tabular}{lccccc}
\toprule
Variant & TAR & FAR & RAR & $\Delta_{tok}$ & Lat. \\
\midrule
w/o Chain & $96.9{\pm}0.6$ & $7.1{\pm}0.5$ & $95.8{\pm}0.7$ & $+0.29\%$ & $0.9$ \\
w/o Codebook & $97.2{\pm}0.5$ & $1.2{\pm}0.2$ & $96.0{\pm}0.6$ & $+2.7\%$ & $0.9$ \\
w/o Fuzzy Verif. & $94.1{\pm}0.7$ & $1.0{\pm}0.2$ & $82.6{\pm}1.0$ & $+0.28\%$ & $0.9$ \\
$K{=}4,\,b{=}8$ & $97.5$ & $1.6$ & $96.8$ & $+0.3\%$ & $0.7$ \\
$K{=}8,\,b{=}10$ & $\mathbf{98.1}$ & $\mathbf{0.9}$ & $\mathbf{97.3}$ & $+0.3\%$ & $0.9$ \\
$K{=}16,\,b{=}12$ & $98.2$ & $0.7$ & $97.5$ & $+0.3\%$ & $1.1$ \\
$(w,r,\tau)=(4,1,0.60)$ & $96.7$ & $0.9$ & $95.9$ & $+0.3\%$ & $0.9$ \\
$(8,2,0.70)$ & $\mathbf{97.6}$ & $\mathbf{1.1}$ & $\mathbf{96.9}$ & $+0.3\%$ & $0.9$ \\
$(12,3,0.75)$ & $97.8$ & $1.5$ & $97.0$ & $+0.3\%$ & $1.0$ \\
\bottomrule
\end{tabular}
\end{table}


Removing the chain catastrophically increases FAR, showing its necessity for splice integrity. Eliminating codebooks forces random padding, raising $\Delta_{\text{tok}}$ by almost an order of magnitude. Fuzzy verification is crucial for paraphrase tolerance (RAR drops $>10\%$ without it). Larger $K$ and $b$ improve robustness at marginal cost, while moderate alignment $(w{=}8,r{=}2,\tau{=}0.70)$ offers the best balance of tolerance and specificity.

\subsection{Overhead and Calibration}
We evaluate runtime efficiency and score calibration of PromptHash. 
Across all datasets, token overhead $\Delta_{\text{tok}}$ remains within $0.2$–$0.4\%$, showing that semantics-preserving rendering introduces only marginal inflation. CPU pre-/post-processing latency is $0.9{\pm}0.1$\,ms per 1k tokens, while GPU overhead is negligible since all operations run on CPU. 
Breaking down costs, normalization and segmentation dominate with roughly 40\% of total latency, while alignment and verification contribute less than 10\%. These results confirm that PromptHash can be deployed in latency-sensitive settings with virtually no runtime burden. 

Calibration analysis further demonstrates reliable verification. Empirical acceptance rates closely match predicted probabilities, producing near-diagonal calibration curves with a low Brier score of $0.031$. 
This indicates that verification scores are well-calibrated, which is critical for threshold tuning in high-stakes forensic applications. 

\subsection{Summary of Findings}
Across three complementary benchmarks (PAC, SRB, IIT), PromptHash consistently achieves high TAR ($>\!98\%$) and low FAR ($<\!1\%$) while maintaining benign acceptance under paraphrase (RAR $>\!96\%$). Robustness analysis confirms that PromptHash tolerates paraphrase intensity and resists splice attacks with negligible overhead ($0.2$–$0.4\%$ token inflation; $0.9$\,ms latency per 1k tokens).  Ablation studies highlight that the hash chain is indispensable for splice integrity, while fuzzy verification is key for paraphrase tolerance. Calibration analysis further shows near-perfect alignment between predicted and empirical acceptance rates. In summary, PromptHash establishes a reliable and efficient instruction-side watermarking primitive for LLM session forensics, balancing robustness, efficiency, and usability.


\section{Conclusion}
\label{sec:conclusion}
We addressed the central forensic challenge of cross-context splicing with paraphrasing by proposing \emph{PromptHash}, a self-authenticating, instruction-side watermark that cryptographically binds normalized instruction segments to session metadata and renders keyed tags via compact, semantics-preserving codebooks. A position-sensitive hash chain enforces splice/edit integrity, while a fuzzy verifier maintains acceptance under benign paraphrase with explicit error bounds. Implemented as a lightweight pre-/post-processor, PromptHash achieves high true-accept rates with sub–1\% false accepts and ${<}0.4\%$ token inflation at sub-millisecond CPU latency, outperforming content/behavior detectors and avoiding the applicability gap of output-side watermarks for instruction attribution. These results establish instruction-side watermarking as a practical primitive for accountable, auditable LLM deployments, with promising extensions to learned multilingual codebooks, tighter theoretical bounds, and end-to-end provenance integration in future work.





\bibliography{iclr2026_conference}
\bibliographystyle{iclr2026_conference}



\end{document}
