Track: Main paper track (up to 5 pages excluding references and appendix)
Keywords: large language models, state space models, neuromorphic hardware, quantization, sparsity
TL;DR: We implement a MatMul-free LLM on neuromorphic hardware and show constant scaling with sequence length and >2x throughput with >2x less energy per token.
Abstract: Large language models (LLMs) deliver impressive performance but require large amounts of energy. In this work, we present a MatMul-free LLM architecture adapted for Intel’s neuromorphic processor, Loihi 2. Our approach leverages Loihi 2’s support for low-precision, event-driven computation and stateful processing. Our hardware-aware quantized model on GPU demonstrates that a 370M parameter MatMul-free model can be quantized with no accuracy loss. Based on preliminary results, we report up to 3× higher throughput with 2× less energy, compared to transformer-based LLMs on an edge GPU, with significantly better scaling. Further hardware optimizations will improve increase throughput and decrease energy consumption. These results show the potential of neuromorphic hardware for efficient inference and pave the way for efficient reasoning models capable of generating complex, long-form text rapidly and cost-effectively.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 68
Loading