LLM Evaluate: An Industry-Focused Evaluation Tool for Large Language Models

Harsh Saini, Md. Tahmid Rahman Laskar, Cheng Chen, Elham Mohammadi, David Rossouw

Published: 2025, Last Modified: 03 Sept 2025COLING (Industry) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Large Language Models (LLMs) have demonstrated impressive capability to solve a wide range of tasks in recent years. This has inspired researchers and practitioners in the real-world industrial domain to build useful products via leveraging LLMs. However, extensive evaluations of LLMs, in terms of accuracy, memory management, and inference latency, while ensuring the reproducibility of the results are crucial before deploying LLM-based solutions for real-world usage. In addition, when evaluating LLMs on internal customer data, an on-premise evaluation system is necessary to protect customer privacy rather than sending customer data to third-party APIs for evaluation. In this paper, we demonstrate how we build an on-premise system for LLM evaluation to address the challenges in the evaluation of LLMs in real-world industrial settings. We demonstrate the complexities of consolidating various datasets, models, and inference-related artifacts in complex LLM inference pipelines. For this purpose, we also present a case study in a real-world industrial setting. The demonstration of the LLM evaluation tool development would help researchers and practitioners in building on-premise systems for LLM evaluation ensuring privacy, reliability, robustness, and reproducibility.

External IDs:dblp:conf/coling/SainiLCMR25