Keywords: benchmarks, survival analysis, time to event, fairness, health equity, responsible ai, heterogeneous treatment effects
TL;DR: We introduce a comprehensive benchmarking framework for evaluating survival models under realistic settings and also present actionable insights based on systematic evaluation of common survival models on a wide range of datasets.
Abstract: Survival models are widely used to model time-to-event or survival data, which represents the duration until an event of interest occurs. In clinical research, survival analysis is used for estimating the effects of treatments on patient health outcomes. Recent advancements in machine learning (ML) have aimed to improve survival analysis methods, but current evaluation practices largely focus on predictive performance, often neglecting critical factors such as the ability to accurately estimate treatment effects and possible consequences on health equity. Estimating treatment effects from time-to-event data presents unique challenges due to the complex problem setting, the extensive assumptions required for causal inference, biased observational data, and the ethical consequences of using model outcomes in real-world health decisions. In this work, we introduce a comprehensive benchmarking framework designed to evaluate survival models on their ability to estimate treatment effects under realistic conditions and in the presence of potential inequalities. We formalize the discussion of bias in survival modelling, identifying key sources of inequity, and outline practical desiderata for methods that model time-to-event treatment effects. We clarify common assumptions in survival analysis, discuss critical shortcomings in current evaluation practices, and propose a new benchmarking metric that can be used to better evaluate model calibration. Using this framework, we systematically compare traditional and modern survival models across multiple synthetic and real world datasets, investigating, among other challenges, model performance under mis-specification and observational biases. Through this benchmark, we provide actionable insights for researchers to develop more robust and equitable survival models.
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3975
Loading