Breakthrough low-latency, high-energy-efficiency LLM inference performance using NorthPole

Alexander Andreopoulos

Published: 26 Sept 2024, Last Modified: 30 Sept 2024IEEE HPECEveryoneCC BY-NC 4.0

Abstract: For a 3-billion-parameter LLM, a research prototype inference appliance with 16 IBM AIU NorthPole processors delivers a massive 28,356 tokens / second of system throughput and sub-1 ms / token (per-user) latency while consuming merely 672 W for 16 NorthPole cards in a compact 2U form factor. With a focus on low latency and high energy efficiency, when NorthPole (in 12 nm) is compared to a suite of GPUs (in 7 / 5 / 4 nm) at various power consumptions, at the lowest GPU latency, NorthPole provides 72.7⇥ better energy metric (tokens / second / W) while providing better latency.