Formally Exploring Time-Series Anomaly Detection Evaluation Metrics

Dennis Wagner; Arjun Nair; Billy Joe Franks; Justus Arweiler; Aparna Muraleedharan; Indra Jungjohann; Fabian Hartung; Andriy Balinskyy; Saurabh Varshneya; Mayank Chetan Ahuja; Nabeel Hussain Syed; Mayank Nagda; Philipp Liznerski; Steffen Reithermann; Maja Rudolph; Sebastian Josef Vollmer; Ralf Schulz; Torsten Katz; Stephan Mandt; Michael Bortz; Heike Leitte; Daniel Neider; Jakob Burger; Fabian Jirasek; Hans Hasse; Sophie Fellenz; Marius Kloft

Formally Exploring Time-Series Anomaly Detection Evaluation Metrics

Published: 03 Feb 2026, Last Modified: 23 Apr 2026AISTATS 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Detecting anomalies in time series is vital to ensure safety and reliability in many real-world applications. Despite the staggering number of anomaly detection methods, it remains unclear which methods perform best, largely due to flawed evaluation practices. Without rigorous analysis, evaluations yield unintuitive or misleading comparisons. Existing evaluation metrics often focus on specifics and, therefore, fail to capture essential aspects of the anomaly detection task. In this work, we formalize the problem by introducing verifiable properties of evaluation metrics that individually reflect important aspects of anomaly detection in time series. By formalizing requirements and analyzing them systematically, we outline a theoretical framework for evaluating time-series anomaly detection that can support principled evaluations and reliable comparisons. We analyze 37 known metrics and prove that most satisfy only few and none satisfy all properties, explaining many observed inconsistencies in evaluations. To address this gap, we introduce a new flexible evaluation metric LARM that provably satisfies all properties. We illustrate the adaptability of this approach by refining the properties to satisfy stricter requirements and adapting LARM to these advanced properties yielding ALARM.

Submission Number: 1489

Loading