PUMA: Secure Inference of LLaMA-7B in Five Minutes

Ye Dong; Wen-jie Lu; Yancheng Zheng; Haoqi Wu; Derun Zhao; Jin Tan; Zhicong Huang; Cheng Hong; Tao Wei; Wenguang Chen

PUMA: Secure Inference of LLaMA-7B in Five Minutes

Ye Dong, Wen-jie Lu, Yancheng Zheng, Haoqi Wu, Derun Zhao, Jin Tan, Zhicong Huang, Cheng Hong, Tao Wei, Wenguang Chen

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: societal considerations including fairness, safety, privacy

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Secret Sharing, Privacy, Security, Transformer, Secure multi-party computation

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: To protect user's prompt against LLM servers, we designed an efficient and accurate secure multi-party computation approach.

Abstract: With ChatGPT as a representative, tons of companies have began to provide ser- vices based on large Transformers models. However, using such a service inevitably leak users’ prompts to the model provider. Previous studies have studied secure in- ference for Transformer models using secure multiparty computation (MPC), where model parameters and clients’ prompts are kept secret. Despite this, these frame- works are still limited in terms of model performance, efficiency, and deployment. To address these limitations, we propose framework PUMA to enable fast and secure Transformer model inference. Our framework designs high quality approximations for expensive functions such as GeLU and softmax, and significantly reduce the cost of secure inference while preserving the model performance. Additionally, we design secure Embedding and LayerNorm procedures that faithfully implement the desired functionality without undermining the Transformer architecture. PUMA is about 2× faster than the state-of-the-art framework MPCFORMER(ICLR 2023) and has similar accuracy as plaintext models without fine-tuning (which the previous works failed to achieve). PUMA can even evaluate LLaMA-7B in around 5 minutes to generate 1 token. To our best knowledge, this is the first time that a model with such a parameter size is able to be evaluated under MPC.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4557

Loading