Secure Autoregressive Inference with Prompt Separation via Key-Value Caching

ICLR 2026 Conference Submission15350 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: privacy-preserving inference, large language model, autoregressive generative models
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance, driving their widespread adoption across various applications. This prevalence increases the importance of user request privacy during inference. While Fully Homomorphic Encryption (FHE) and Secure Multi-Party Computation (MPC) offer promising solutions for privacy-preserving inference, they suffer from significant latency overhead, limiting practical deployment. Prior research has explored more efficient cryptographic primitives and polynomial approximations for non-linear operations. However, the inference latency remains significantly higher than that of plaintext execution. To further mitigate computational overhead, we introduce a novel approach that leverages prompt separation with key value caching. Our method accelerates secure inference by processing non-sensitive tokens in plaintext and using their key-value caches when subsequently processing private tokens. To ensure effective contextual reasoning, we also introduce an attention mask adjustment mechanism that constrains privacy-sensitive tokens to attend to nearby tokens from their original masked positions. Through experiments across various LLM architectures and MPC frameworks, we show that our approach achieves a 1.5-2.5$\times$ reduction in inference latency without significant performance degradation.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 15350
Loading