Abstract: Logs of large-scale cloud systems record diverse system events, ranging from routine statuses to critical errors. As the fundamental step of automated log analysis, log parsing is to transform unstructured logs into structured data for easier management and analysis. However, existing syntax-based and deep learning-based parsers struggle with complex real-world logs. Recent parsers based on large language models (LLMs) achieve higher accuracy, but they typically rely on online APIs (e.g., ChatGPT), raising privacy concerns and suffering from network latency. Moreover, with the rise of artificial intelligence for IT operations (AIOps), traditional parsers that focus on syntax-level templates fail to capture the semantics of dynamic log parameters, limiting their usefulness for downstream tasks. These challenges highlight the need for semantic log parsing that goes beyond template extraction to understand parameter semantics. This paper presents SemanticLog, an effective and efficient semantic log parser powered by open-source LLMs. SemanticLog adapts the structure of LLMs to the log parsing task, leveraging their rich knowledge while safeguarding log data privacy. It first extracts informative feature representations from log data, then refines them through fine-grained semantic perception to enable accurate template and parameter extraction together with semantic category prediction. To boost scalability, SemanticLog introduces the EffiParsing tree for faster inference on large-scale logs. Extensive experiments on the LogHub-2.0 dataset show that SemanticLog significantly outperforms the state-of-the-art log parsers in terms of accuracy. Moreover, it also surpasses existing LLM-based parsers in efficiency while showcasing advanced semantic parsing capability. Notably, SemanticLog employs much smaller open-source LLMs compared to existing LLM-based parsers (mainly based on ChatGPT), while maintaining better capability of log data privacy protection.
External IDs:doi:10.1109/tse.2025.3625121
Loading