Activation Swapping for Feed Forward Networks in Batched LLM Inference on NPU

Published: 2025, Last Modified: 07 Nov 2025AICAS 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Accelerating large language model (LLM) inference on modern hardware systems such as neural processing units (NPU) is essential. While increasing a batch potentially increases throughput for feed-forward networks (FFNs) in batched LLM inference, a batch size may be limited by the size of the activation scratchpad on NPU. In addition, increasing the batch size may not gain throughput, as it depends on the model and NPU configuration. This work introduces a new concept of an effective batch size for batched LLM inference on NPU. This new indicator represents the maximum batch size for a given model and hardware that can achieve throughput gain with an increase in batch size. Under the effective batch size, we propose activation swapping - a scheduling method that enables the swap-in/out of activations for FFN to/from DRAM, increasing the batch size with minimal external memory access overhead. Experimental results demonstrate that the proposed method can increase the batch size by 33.3% and achieve up to 1.27x higher throughput on the GPT-2 2.5B model with given hardware configuration.
Loading