Proof-of-Concept for Private Local-to-Cloud LLM Chat via Trusted Execution Environments

Published: 11 Jun 2025, Last Modified: 10 Jul 2025ES-FoMo IIIEveryoneRevisionsBibTeXCC BY 4.0
Keywords: security, performance, language modeling
TL;DR: Exploring performance aspects of an end-to-end secure chat with an LLM server on a remote confidential GPU
Abstract: Cloud-based LLM assistants pass every prompt through cloud servers in plaintext, leaving personal information open to inspection by cloud providers and any malicious actors with access to their servers. Current privacy techniques either degrade quality or are several orders of magnitude slower. In contrast, Trusted Execution Environments (TEEs) offer a practical path forward, taking a hardware-based approach. We explore recent TEE-based virtual machines with confidential NVIDIA H100 and AMD SEV-SNP CPUs. Naive Pytorch use inside this TEE incurs a 1.87× slowdown due to CPU-GPU encryptions. Moreover, there is a lack of open-source communication protocols between a local client and such a remote TEE. In response, we propose TEEChat, a research prototype that (1) binds a local client to a remote TEE hosting an LLM, via attestation and key exchange, (2) secures communication with full end-to-end encryption, and (3) minimizes overhead with targeted kernel and I/O optimizations. For models over 30B parameters, TEEChat adds just 1\% latency—showing that LLM inference inside TEEs is already practical.
Submission Number: 139
Loading