EVEREST: A Transformer for Probabilistic Rare-Event Anomaly Detection with Evidential and Tail-Aware Uncertainty

EVEREST: A Transformer for Probabilistic Rare-Event Anomaly Detection with Evidential and Tail-Aware Uncertainty

ICLR 2026 Conference Submission25448 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Transformer models, Uncertainty quantification, Evidential deep learning, Extreme value theory, Imbalanced classification

TL;DR: EVEREST is a transformer architecture for rare-event time-series forecasting that combines evidential and tail-aware uncertainty to deliver calibrated, interpretable, and state-of-the-art predictions across scientific anomaly detection tasks.

Abstract: Forecasting rare events in multivariate time-series data is a central challenge in machine learning, complicated by severe class imbalance, long-range dependencies, and distributional uncertainty. We introduce EVEREST, a transformer-based architecture for probabilistic rare-event forecasting that delivers calibrated predictions and tail-aware risk estimation, with auxiliary interpretability through attention-based signal attribution. EVEREST integrates four key components: (i) a learnable attention bottleneck for soft aggregation of temporal dynamics; (ii) an evidential head for estimating aleatoric and epistemic uncertainty via a Normal–Inverse–Gamma distribution; (iii) an extreme-value head that models tail risk using a Generalized Pareto Distribution; and (iv) a lightweight precursor head for early-event detection. These modules are jointly optimised with a composite loss combining focal loss, evidential negative log-likelihood, and a tail-sensitive EVT penalty, and act only at training time; deployment uses a single classification head with no inference overhead. We evaluate EVEREST on a real-world benchmark spanning a decade of space-weather data and demonstrate state-of-the-art performance, including True Skill Statistic (TSS) scores of 0.973, 0.970, and 0.966 at 24, 48, and 72-hour horizons for C-class flares. The model is compact (≈0.81M parameters), efficient to train on commodity hardware, and applicable to other high-stakes domains such as industrial monitoring, weather, and satellite diagnostics. Limitations include reliance on fixed-length inputs and exclusion of image-based modalities, motivating future extensions to streaming and multimodal forecasting.

Supplementary Material: zip

Primary Area: learning on time series and dynamical systems

Submission Number: 25448

Loading