Abstract: For a 3-billion-parameter LLM, a research prototype inference appliance with 16 IBM AIU NorthPole processors delivers a massive 28,356 tokens / second of system throughput and sub-1 ms / token (per-user) latency while consuming merely 672 W for 16 NorthPole cards in a compact 2U form factor. With a focus on low latency and high energy efficiency, when NorthPole (in 12 nm) is compared to a suite of GPUs (in 7/5/4 nm) at various power consumptions, at the lowest GPU latency, NorthPole provides 72.7x better energy metric (tokens/second/W) while providing better latency.
External IDs:dblp:conf/hpec/AppuswamyDTECAA24
Loading