EncryptedLLM: Privacy-Preserving Large Language Model Inference via GPU-Accelerated Fully Homomorphic Encryption

Leo de Castro; Daniel Escudero; Adya Agrawal; Antigoni Polychroniadou; Manuela Veloso

EncryptedLLM: Privacy-Preserving Large Language Model Inference via GPU-Accelerated Fully Homomorphic Encryption

Leo de Castro, Daniel Escudero, Adya Agrawal, Antigoni Polychroniadou, Manuela Veloso

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We develop a GPU-accelerated implementation of FHE and use this implementation to evaluate GPT-2.

Abstract: As large language models (LLMs) become more powerful, the computation required to run these models is increasingly outsourced to a third-party cloud. While this saves clients' computation, it risks leaking the clients' LLM queries to the cloud provider. Fully homomorphic encryption (FHE) presents a natural solution to this problem: simply encrypt the query and evaluate the LLM homomorphically on the cloud machine. The result remains encrypted and can only be learned by the client who holds the secret key. In this work, we present a GPU-accelerated implementation of FHE and use this implementation to benchmark an encrypted GPT-2 forward pass, with runtimes over $200\times$ faster than the CPU baseline. We also present novel and extensive experimental analysis of approximations of LLM activation functions to maintain accuracy while achieving this performance.

Lay Summary: Large language models (LLMs) are typically deployed in cloud environments. To use these models, the user's data must be sent to an external cloud machine. For sensitive queries (e.g., topics related to healthcare or finance), this represents a major security concern. This work improves the efficiency of techniques to privately evaluate models over sensitive queries. This allows users to safely send their query to a cloud machine and receive the model output without allowing the cloud to learn anything about their data. The main underlying tool is an advanced cryptography primitive called fully homomorphic encryption (FHE), and a technical contribution of this work is a new GPU-accelerated implementation of FHE. We also develop methods to evaluate LLMs using FHE while preserving the quality of the model outputs.

Primary Area: Social Aspects->Privacy

Keywords: Fully homomorphic encryption, encrypted inference, approximate activation functions

Submission Number: 12971

Loading