SimFLi: Simple Few-Shot Linear Modeling for On-Device LLM Latency Profiling

WooYoung Song; Juyoung Lee; Taesik Gong

SimFLi: Simple Few-Shot Linear Modeling for On-Device LLM Latency Profiling

WooYoung Song, Juyoung Lee, Taesik Gong

17 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Mobile, Latency, Profiler

TL;DR: Suggest Simple but Effective profiler for mobile latency of LLM

Abstract: On-device inference of large language models (LLMs) is increasingly central to mobile and edge AI, yet profiling their latency remains challenging: existing methods are often server-centric, rely on operator-level instrumentation, or in- cur overheads that make them impractical for constrained devices. We present Simple Few-Shot Lining (SimFLi), a lightweight and training-free profiler that decomposes inference into prefill (time-to-first-token) and decode phases, and estimates latency from only a few token-length probes. Despite its simplicity, SimFLi achieves accurate latency surfaces without requiring training, extensive measurement, or intrusive instrumentation. Across diverse devices and compact LLMs, SimFLi provides consistently strong R2, RMSE, and MAE under a strict few-point budget, reducing measurement cost by more than an order of magni- tude compared to baselines. In practice, SimFLi offers a practical, low-overhead tool to guide model, quantization, and backend choices in real-world on-device deployments.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 9003

Loading