FISHER: A Foundation Model for Multi-Modal Industrial Signal Comprehensive Representation

17 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Self-Supervised Learning, Foundation Model, Anomaly Detection, Fault Diagnosis
TL;DR: A foundation model for industrial signal representation and a corresponding benchmark.
Abstract: Industrial signal analysis has emerged as a critical problem for the industry. Due to severe heterogeneity within industrial signals, which we summarize as the M5 problem, previous works could only deal with small sub-problems by training specialized models, which lacks robustness and incurs huge burdens during development and deployment. However, we argue that the M5 problem can be dealt by scaling up, where dealing with the multi-sampling-rate is the first step. In this paper, we propose FISHER, a Foundation model for multi-modal Industrial Signal compreHEnsiveRepresentation. To support arbitrary sampling rates, FISHER considers the increment of sampling rate as the concatenation of sub-band information. Specifically, FISHER takes the STFT sub-band as the modeling unit and adopts a teacher-student SSL framework for pre-training. To evaluate the model performance, we also develop the RMIS benchmark, which consists of 19 datasets across four modalities. FISHER is compared with 15 SOTA speech/audio/music encoders, demonstrating versatile and outstanding capabilities with a general performance gain of at least 3.23\%. Meanwhile, FISHER possesses much more efficient scaling curves, where even FISHER-tiny with 5.5M parameters outperforms huge baseline encoders up to 2B. We further reveal that the key to success is adaptively utilizing the full signal bandwidth regardless of the sampling rate. Both FISHER and RMIS will be open-sourced.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 8526
Loading