Keywords: trusted execution environments, cloud computing, model privacy, data privacy
Abstract: Recent advances in Transformer-based foundation models (FMs) have driven significant developments across diverse AI tasks, facilitating their deployment in security-sensitive domains. Despite their capabilities, FMs impose substantial inference costs, driving reliance on third-party cloud infrastructure equipped with high-performance computation resources. However, these cloud platforms cannot be fully trusted and remain vulnerable to data breaches, introducing dual confidentiality challenges: protecting user data from exposure and safeguarding models against unauthorized access. Mainstream protection mechanisms leverage trusted execution environments (TEEs), where confidentiality and integrity are enforced through hardware-based isolation, encryption, and integrity verification. But executing inference entirely within TEEs incurs a significant overhead, which is further exacerbated in large-scale FMs. Recent studies have proposed schemes that combine TEEs with untrusted accelerators (e.g., GPUs) to offload partial inference operations. However, prior offloading schemes cannot solve dual confidentiality challenges in FM inference, since operations such as ***Attention*** depend on dynamic operands that prevent secure precomputation and must remain within TEEs. Moreover, the communication overhead between TEEs and accelerators grows dramatically with model scale, constituting a new system design challenge for FMs.
To address these challenges, we propose ***Twinshield***, a framework that enables secure inference of Transformer-based FMs in heterogeneous TEE–accelerator systems with dual protection for both model and data. ***Twinshield*** improves efficiency through ***protocol-level*** outsourcing, which securely offloads the majority of operations to accelerators, and enhances throughput via a ***system-level*** design that overlaps TEE preparation, communication, and accelerator execution. Our evaluation on representative LLMs and VLMs shows that ***Twinshield*** offloads about 87% of computations to accelerators and achieves $3.3\times$–$5.1\times$ speedups over baselines. The code is publicly available at https://anonymous.4open.science/r/Twinshield/README.md.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22926
Loading