Abstract: In recent years, the emergence of large-language models (LLMs) has profoundly transformed our production and lifestyle. These models have shown tremendous potential in fields, such as natural language processing, speech recognition, and recommendation systems, and are increasingly playing crucial roles in applications such as human–computer interaction and intelligent customer service. Efficient inference solutions for LLMs in data centers have been extensively researched, with a focus on meeting users’ quality of service requirements. In this article, we focus on two additional requirements that responsible LLM inference should meet under QoS conditions: security throughout the model execution process and low maintenance requirements for the inference system. Therefore, we propose LLMaaS, a trusted model inference platform based on a serverless computing platform aimed at providing inference as a service for LLMs. First, we design a trusted serverless computing platform based on software guard extension (SGX), which includes distributed identity verification and SGX device plugins to ensure the security and trustworthiness of the inference process. Additionally, to reduce the maintenance requirements of the system, we enhance the SGX-based deep learning computing framework, including replacing PyTorch and using a greedy algorithm for graph partitioning. We conduct tests on four typical large models, and the experimental results demonstrate that, with minimal overhead and user code modifications, we can ensure the security of model execution.
Loading