SAFE: Benchmarking AI Weather Prediction Fairness with Stratified Assessments of Forecasts over Earth
Keywords: faireness, weather, climate, artificial intelligence, machine learning
TL;DR: AI weather prediction models exhibit biases in forecast performance based on geographic region, income, landcover, and lead time.
Abstract: The dominant paradigm in machine learning is to assess model performance based on average loss across all samples in some test set. However, this approach fails to account for the non-uniform patterns of human development and geography that exist across Earth. We introduce Stratified Assessments of Forecasts over Earth (SAFE), a package for elucidating the stratified performance of a set of predictions made over Earth. SAFE integrates various domains of data to perform stratification on different attributes associated with gridpoints: territory (usually country), global subregion, income, and landcover (land or water). This allows us to examine the performance of models for each individual stratum of the different attributes (e.g., the accuracy in every individual country). To demonstrate its importance, we utilize SAFE to benchmark a zoo of state-of-the-art AI-based weather prediction models, finding that they all exhibit disparities in forecasting skill across every attribute. We use this to seed a benchmark of model forecast fairness through stratification at different lead times for various climatic variables. By moving beyond globally-averaged metrics, we for the first time ask: where do models perform best or worst, and which models are most fair? To support further work in this direction, the SAFE package is made available at https://anonymous.4open.science/r/safe-E7C7.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 24827
Loading