A 28nm 3.14 TFLOP/W BF16 LLM Fine-Tuning Processor with Asymmetric Quantization Computing for AI PC

Published: 2025, Last Modified: 29 Jan 2026CICC 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The powerful capabilities of large language models (LLMs) enable them to function as personal digital assistants. To ensure user privacy, personalized fine-tuning can be conducted locally on memoryconstrained AI PCs using Parameter-Efficient Fine-Tuning (PEFT) algorithms, such as QLoRA[1] and QA-LORA[2]. Figure 1 illustrates the computation flow of QLoRA: BF16 input activations undergo matrix multiplication with 4-bit quantized pre-trained weights and BF16 adapter weights. In this flow, asymmetric quantization MACs represent the primary bottleneck, consuming approximately 97% of the computational load. However, current neural processing units (NPUs) offer limited support for asymmetric computation: fine-tuning Llama2-13B on an RTX 3090 takes over 25 hours. This highlights the need for fine-tuning processors optimized for asymmetric quantization. Yet, asymmetric quantization presents hardware design challenges: 1) Existing NPUs primarily support symmetric formats, introducing conversion overhead and inefficiencies; 2) Current NPUs lack efficient support for low-precision data transposition; and 3) 4-bit quantized QLoRA encounters high external access and storage demands, while the use of 2:4 sparsity in low-bit LLM finetuning[3] incurs substantial bitmask overhead with limited benefits.
Loading