A Learnable Cross-Modal Adapter for Industrial Fault Detection Using Pretrained Vision Models

Jonne van Dreven, Abbas Cheddad, Sadi Alawadi, Ahmad Nauman Ghazi, Jad Al Koussa, Dirk Vanhoudt

Published: 11 Feb 2026, Last Modified: 07 May 2026IEEE Transactions on Industrial InformaticsEveryoneCC BY 4.0

Abstract: Automatic Fault Detection and Diagnosis (FDD) are critical for maintaining reliable and efficient industrial systems. However, conventional methods rely heavily on manual inspections or threshold-based techniques, which often fail to capture the dynamic patterns in Time Series (TS) sensor data. As a result, faults persist for extended periods, leading to suboptimal system operations, increased energy waste, and significant economic losses. This work proposes a cross-modal framework that facilitates the efficient deployment of state-of-the-art pre-trained vision models for enhanced FDD, with two novel TS-to-image transformations: (i) an adapter deep encoder that learns optimal, task-specific representations from raw sensor data while generating outputs that are input-compliant with pre-trained models. (ii) an enhanced line plot that creates geometric shapes of two related signals. Comparative experiments against fixed methods, including spectrograms, Gramian Angular Fields, Markov Transition Fields, Recurrence Plots, and five deep learning baseline models, showed substantial performance gains across diverse domains. InceptionTime achieved the highest average baseline performance with an F$_1$ of $88.6\%$, while the adapter and shapes achieved $94.4\%$ and $92.4\%$, respectively. The findings highlight the potential of the cross-modal framework for FDD to facilitate early intervention and efficient system maintenance in industrial settings.