MAD:A Multimodal Anomaly Detection Framework Based on Shared Transformer and Contrastive Learning for Smart Manufacturing

ICLR 2026 Conference Submission17774 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal, Contrastive Learning, Normalization Flow, Representation Learning, Anomaly Detection
Abstract: With the advancement of smart manufacturing environments, anomaly detection techniques that integrate heterogeneous composite sensor data are becoming increasingly important. However, there are still technical difficulties in effectively fusing data with different characteristics, such as PRPD images and PD time series. To address these issues, this study proposes a high-performance multimodal framework, MAD, based on a two-step training strategy. First, to reduce the representation differences between modalities, a RealNVP-based normalization flow is introduced to align the representations of each modality into a shared latent space. Second, we use Supervised Contrastive Learning to learn a structured representation space with well-defined boundaries between classes. The aligned and structured representations are then fed into the LIMoE encoder, a Mixture-of-Experts-based shared transformer, to finally classify the types of anomalies. Experimental results demonstrate that the proposed MAD model outperforms the existing SOTA multimodal models. In particular, MAD achieves an AUC of 100.0% and F1-score of 99.98%, which is comparable to that of Perceiver IO, Cross-Modal Transformer, and CFM.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 17774
Loading