LogEval: A comprehensive benchmark suite for LLMs in log analysis

Published: 2025, Last Modified: 09 Jan 2026Empir. Softw. Eng. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Log analysis is vital in Artificial Intelligence for IT Operations (AIOps) and plays a crucial role in ensuring software reliability and system stability. However, challenges such as the absence of comprehensive evaluation standards, inconsistencies in benchmarking practices, and limited exploration of Large Language Models (LLMs) in log-related tasks persist. To address these issues, we introduce LogEval, a comprehensive benchmark designed to systematically evaluate LLMs’ performance across four key log analysis tasks: log parsing, log anomaly detection, log fault diagnosis, and log summarization. LogEval systematically tackles these challenges through the following aspects: (i) it incorporates 4,000 publicly available log entries, spanning diverse tasks and providing a strong foundation for evaluating LLM performance; (ii) it utilizes standardized prompts in both English and Chinese to ensure consistent and objective evaluations, this benchmark covers two experimental paradigms: Naive question-answering (Q&A) and self-consistency (SC) Q&A, under both zero-shot and few-shot settings, while also considering inference efficiency and average token usage; (iii) it features an open-source, continuously updated platform (https://nkcs.iops.ai/LogEval/) that integrates new LLMs and user-uploaded production data, fostering reproducibility and adaptability in performance comparisons. The experimental results provide valuable insights into the varying strengths of LLMs across different tasks, highlighting opportunities for further optimization and innovation for LLMs in log analysis. Our code repository is available at https://github.com/LinDuoming/LogEval.
Loading