mTSBench: Benchmarking Multivariate Time Series Anomaly Detection and Model Selection at Scale

Published: 02 Mar 2026, Last Modified: 02 Mar 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Anomaly detection in multivariate time series is essential across domains such as healthcare, cybersecurity, and industrial monitoring, yet remains fundamentally challenging due to high-dimensional dependencies, the presence of cross-correlations between time-dependent variables, and the scarcity of labeled anomalies. We introduce mTSBench, the largest benchmark to date for multivariate time series anomaly detection and model selection, consisting of 344 labeled time series across 19 datasets from a wide range of application domains. We comprehensively evaluate 24 anomaly detectors, including the only two publicly available large language model-based methods for multivariate time series. Consistent with prior findings, we observe that no single detector dominates across datasets, motivating the need for effective model selection. We benchmark three recent model selection methods and find that even the strongest of them remain far from optimal. Our results highlight the outstanding need for robust, generalizable selection strategies. We open-source the benchmark at \url{https://plan-lab.github.io/mtsbench} to encourage future research.
Certifications: Reproducibility Certification
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Dear AC and Reviewers, We sincerely thank the AC and the reviewers for their constructive and thoughtful feedback, which has strengthened this paper. We have carefully addressed the latest comments from `Reviewer z1GP` by clarifying the scope of our conclusions to focus on the evaluated unsupervised selectors, strengthening the failure-mode discussion, and adding concise practical guidance for interpreting model selection results. In addition, following the AC's suggestion, we added a new discussion on dataset quality, including comparative case studies of flawed and well-curated time series with visualizations, performance comparisons, and in-depth analysis of how labeling and data issues affect evaluation outcomes. We greatly appreciate these suggestions, which helped improve the clarity, rigor, and practical relevance of our work. Thank you again for your time and insightful feedback.
Assigned Action Editor: ~Min_Wu2
Submission Number: 6187
Loading