Breakthrough low-latency, high-energy-efficiency LLM inference performance using NorthPole
Abstract: For a 3-billion-parameter LLM, a research prototype inference appliance with 16 IBM AIU NorthPole processors
delivers a massive 28,356 tokens / second of system throughput
and sub-1 ms / token (per-user) latency while consuming merely
672 W for 16 NorthPole cards in a compact 2U form factor.
With a focus on low latency and high energy efficiency,
when NorthPole (in 12 nm) is compared to a suite of GPUs
(in 7 / 5 / 4 nm) at various power consumptions, at the lowest
GPU latency, NorthPole provides 72.7⇥ better energy metric
(tokens / second / W) while providing better latency.
Loading