Augmenting Industrial Maintenance with LLMs: A Benchmark, Analysis, and Generalization Study

Augmenting Industrial Maintenance with LLMs: A Benchmark, Analysis, and Generalization Study

ICLR 2026 Conference Submission22110 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Benchmarking, Representation Learning, Large Language Models, Industry 4.0

TL;DR: Augmenting Industrial Maintenance with LLMs: A Benchmark, Analysis, and Generalization Study

Abstract: Monitoring the life cycle of complex industrial systems often relies on expertly curated temporal conditions derived from sensor data, a process that requires significant time investment and deep domain expertise. We explore the potential of utilizing Large Language Models (LLMs) to generate context-aware and accurate recommendations for maintenance based on their ability to reason and generalize on temporal sensor conditions. To this end, we formulate a novel pipeline that systematically converts human-authored symbolic conditions into a multiple-choice question answer (MCQA) dataset. We apply our pipeline by creating DiagnosticIQ, a 6,000+ MCQA dataset covering 16 different types of physical assets that represent real-world maintenance use cases. We assess 15 state-of-the-art large language models (LLMs) with this dataset and create a leaderboard for the maintenance action recommendation task. Furthermore, we evaluate and demonstrate the practical utility of DiagnosticIQ in two key aspects. First, as a knowledge base to enhance maintenance action recommendations, and secondly, as a fine-tuning resource to fine-tune a specialized LLM that generalizes across previously unseen assets to facilitate the rule creation process.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 22110

Loading