Benchmarking Reliability and Generalization Beyond Classification

ICLR 2026 Conference Submission18439 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: benchmark, semantic segmentation, object detection, common corruptions, OOD, adversarial attacks
TL;DR: robustness benchmarking tools and benchmarks for semantic segmentation and object detection.
Abstract: Reliability and generalization in deep learning are predominantly studied in the context of image classification. Yet, possible real-world applications in safety-critical domains involve a broader set of semantic tasks, such as semantic segmentation and object detection, which come with a diverse set of dedicated model architectures. To facilitate research towards robust model design in segmentation and detection, our primary objective is to provide benchmarking tools regarding robustness to distribution shifts and adversarial manipulations. We propose the benchmarking tools SEMSEGBENCH and DETECBENCH, along with the most extensive evaluation to date on the reliability and generalization of semantic segmentation and object detection models. In particular, we benchmark 76 segmentation models across four datasets and 61 object detectors across two datasets, evaluating their performance under diverse adversarial attacks and common corruptions. Our findings reveal systematic weaknesses in state-of-the-art models and uncover key trends based on architecture, backbone, and model capacity. SEMSEGBENCH and DETECBENCH are open-sourced in an Anonymous Repository (URL: https://anonymous.4open.science/r/benchmarking_reliability_generalization/) with our complete set of 6139 evaluations. We anticipate the collected data to foster and encourage future research towards improved model reliability beyond classification.
Primary Area: datasets and benchmarks
Submission Number: 18439
Loading