Bench360: Benchmarking Local LLM Inference from 360$\degree$

Linus Stuhlmann; Mauricio Fadel Argerich; Jonathan Fürst

Bench360: Benchmarking Local LLM Inference from 360$\degree$

Linus Stuhlmann, Mauricio Fadel Argerich, Jonathan Fürst

19 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Inference, Benchmark

TL;DR: Bench360 is a new LLM inference benchmark that supports custom tasks, multiple inference frameworks, quantization levels and usage scenarios, while monitoring relevant task and system metrics.

Abstract: Running large language models (LLMs) locally is becoming increasingly common. While the growing availability of small open-source models and inference engines has lowered the entry barrier, users now face an overwhelming number of configuration choices. Identifying an optimal configuration---balancing functional and non-functional requirements--- requires substantial manual effort. While several benchmarks target LLM inference, they are designed for narrow evaluation goals and not user-focused. They fail to integrate relevant system and task-specific metrics into a unified, easy-to-use benchmark that supports multiple inference engines, usage scenarios, and quantization levels. To address this gap, we present Bench360---Benchmarking Local LLM Inference from 360°. Bench360 allows users to easily define their own custom tasks along with datasets and relevant task-specific metrics and then automatically benchmarks selected LLMs, inference engines, and quantization levels across different usage scenarios (single stream, batch & server). Bench360 tracks a wide range of metrics, including (1) system metrics---such as Computing Performance (e.g., latency, throughput), Resource Usage (e.g., energy per query), and Deployment (e.g., cold start time)---and (2) task-specific metrics such as ROUGE, F1 score or accuracy. We demonstrate Bench360 on four common LLM tasks---General Knowledge & Reasoning, QA, Summarization and Text-to-SQL---across three hardware platforms and four state of the art inference engines. Our results reveal several interesting trade-offs between task performance and system-level efficiency, highlighting the differences in inference engines and models. Most importantly, there is no single best setup for local inference, which strongly motivates the need for a framework such as Bench360.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 17223

Loading