Keywords: LLM, Mobile, Latency, Profiler
TL;DR: Suggest Simple but Effective profiler for mobile latency of LLM
Abstract: On-device inference of large language models (LLMs) is increasingly central to
mobile and edge AI, yet profiling their latency remains challenging: existing
methods are often server-centric, rely on operator-level instrumentation, or in-
cur overheads that make them impractical for constrained devices. We present
Simple Few-Shot Lining (SimFLi), a lightweight and training-free profiler that
decomposes inference into prefill (time-to-first-token) and decode phases, and
estimates latency from only a few token-length probes. Despite its simplicity,
SimFLi achieves accurate latency surfaces without requiring training, extensive
measurement, or intrusive instrumentation. Across diverse devices and compact
LLMs, SimFLi provides consistently strong R2, RMSE, and MAE under a strict
few-point budget, reducing measurement cost by more than an order of magni-
tude compared to baselines. In practice, SimFLi offers a practical, low-overhead
tool to guide model, quantization, and backend choices in real-world on-device
deployments.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 9003
Loading