PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM

Published: 01 Jan 2025, Last Modified: 16 May 2025HPCA 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Transformer-based Large Language Models (LLMs) demand significant computational and memory resources due to the autoregressive token generation in decoder blocks. In particular, the attention layer in LLM models has low arithmetic intensity but high memory traffic, thus requiring frequent updates to the KV matrices with each decoder iteration. As a result, LLM inference becomes memory bound, leading to increased latency. To address this, we introduce PAISE, a framework leveraging Processing-In-Memory (PIM) technology to offload memory-intensive tasks. PAISE employs GPU-PIM heterogeneous computing resources to optimize inference operations in transformer-based LLMs. The framework comprises (i) a scheduling algorithm that decides which operations to offload to PIM based on model configuration and PIM hardware specifications and (ii) an enhanced PIM kernel that performs transaction-wise interleave-batched GEMM (General Matrix Multiplication) operations, maximizing data throughput via data layout adjustments. We implemented PAISE on the GPT-2 and Llama2-7B models using an AMD MI100 GPU with HBM-PIM devices. Our evaluations show that offloading the attention layer to PIM reduces execution time by up to 48.3% compared to GPU-only inference, demonstrating PAISE’s significant potential to enhance the efficiency of LLM inference, which could lead to faster and more efficient AI applications.
Loading