Keywords: on-device LLM, evaluation, inference
TL;DR: This paper systematically evaluates performance and resource utilization of LLMs on resource-constrained edge devices.
Abstract: The increasing deployment of Large Language Models (LLMs) on edge devices, driven by model advancements and hardware improvements, offers significant privacy benefits. However, these on-device LLMs inherently face performance limitations due to reduced model capacity and necessary compression techniques. To address this, we introduce a systematic methodology---encompassing model capability, development efficiency, and system resources---for evaluating on-device LLMs. Our comprehensive evaluation, encompassing models from 0.5B to 14B parameters and seven post-training quantization (PTQ) methods on commodity laptops, yields several critical insights:
1) System-level metrics exhibit near-linear scaling with effective bits-per-weight (BPW).
2) A practical threshold exists around $\sim$3.5 effective BPW, larger models subjected to low-bit quantization consistently outperform smaller models utilizing higher bit-precision.
3) As model size decreases, the primary performance bottleneck potentially shifts from computation to communication.
4) Determined by low-level implementation specifics power consumption on CPU, computation-intensive operations spend more power than memory-intensive ones.
These insights offer practical guidelines for the efficient deployment and optimized configuration of LLMs on resource-constrained edge devices. Our codebase is available at https://anonymous.4open.science/r/LLMOnDevice/.
Primary Area: datasets and benchmarks
Submission Number: 15144
Loading