Security of NVMe Offloaded Data in Large-Scale Machine Learning

Torsten Krauß, Melanie Götz, Alexandra Dmitrienko

Published: 24 Sept 2023, Last Modified: 06 Mar 2025European Symposium on Research in Computer Security 2023EveryoneCC BY-NC 4.0

Abstract: Large-scale machine learning (LSML) models, such as the GPT-3.5 that powers the well-known ChatGPT chatbot, have revolutionized our perception of AI by enabling more natural, context-aware, and interactive experiences. Yet, training such large models nowadays requires multiple months of computation on expensive hardware, including GPUs, orchestrated by specialized software, so-called LSML frameworks. Due to the model size, neither the on-device memory of GPUs nor the RAM is capable of holding all parameters simultaneously during training. Therefore, LSML frameworks dynamically offload data to NVMe storage and reload the information just in time. In this paper, we investigate the security of NVMe offloaded data in LSML against poisoning attacks and present NVMevade, the first untargeted poisoning attack on NVMe offloads. NVMevade allows the attacker to reduce the model performance, as well as slow down or even stall the training process. For instance, we demonstrate that an attacker can achieve a stealthy increase of 182% in training time, thus, inflating costs for model training. To address this vulnerability, we develop NVMensure, the first defense that guarantees the integrity and freshness of NVMe offloaded data in LSML. By conducting a large-scale study, we demonstrate the robustness of NVMensure against poisoning attacks and explore runtime efficiency and security trade-offs it can provide. We tested 22 different NVMensure configurations and report an overhead between 9.8% and 64.2%, depending on the selected security level. We also note that NVMensure is going to be effective against targeted poisoning attacks which do not exist yet but might be developed in the future.