Harnessing Large Language Models Locally: Empirical Results and Implications

ICLR 2026 Conference Submission15144 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: on-device LLM, evaluation, inference
TL;DR: This paper systematically evaluates performance and resource utilization of LLMs on resource-constrained edge devices.
Abstract: The increasing deployment of Large Language Models (LLMs) on edge devices, driven by model advancements and hardware improvements, offers significant privacy benefits. However, these on-device LLMs inherently face performance limitations due to reduced model capacity and necessary compression techniques. To address this, we introduce a systematic methodology---encompassing model capability, development efficiency, and system resources---for evaluating on-device LLMs. Our comprehensive evaluation, encompassing models from 0.5B to 14B parameters and seven post-training quantization (PTQ) methods on commodity laptops, yields several critical insights: 1) System-level metrics exhibit near-linear scaling with effective bits-per-weight (BPW). 2) A practical threshold exists around $\sim$3.5 effective BPW, larger models subjected to low-bit quantization consistently outperform smaller models utilizing higher bit-precision. 3) As model size decreases, the primary performance bottleneck potentially shifts from computation to communication. 4) Determined by low-level implementation specifics power consumption on CPU, computation-intensive operations spend more power than memory-intensive ones. These insights offer practical guidelines for the efficient deployment and optimized configuration of LLMs on resource-constrained edge devices. Our codebase is available at https://anonymous.4open.science/r/LLMOnDevice/.
Primary Area: datasets and benchmarks
Submission Number: 15144
Loading