CarbonGlobe: A Global-Scale, Multi-Decade Dataset and Benchmark for Carbon Forecasting in Forest Ecosystems

Zhihao Wang; Lei Ma; George Hurtt; Xiaowei Jia; Yanhua Li; Ruohan Li; Zhili Li; Shuo Xu; Yiqun Xie

CarbonGlobe: A Global-Scale, Multi-Decade Dataset and Benchmark for Carbon Forecasting in Forest Ecosystems

Zhihao Wang, Lei Ma, George Hurtt, Xiaowei Jia, Yanhua Li, Ruohan Li, Zhili Li, Shuo Xu, Yiqun Xie

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Long-term time-series forecasting, carbon forecasting, ecosystem modeling

TL;DR: CarbonGlobe is the first publicly available ML-ready dataset for global long-term carbon forecasting, featuring 0.5° resolution, 40-year span, multi-source inputs, calibrated ED outputs, and standardized evaluation scenarios.

Abstract: Forest ecosystems play a critical role in the Earth system as major carbon sinks that are essential for carbon neutralization and climate change mitigation. However, the Earth has undergone significant deforestation and forest degradation, and the remaining forested areas are also facing increasing pressures from socioeconomic factors and climate change, potentially pushing them towards tipping points. Responding to the grand challenge, a theory-based Ecosystem Demography (ED) model has been continuously developed over the past two decades and serves as a key component in major initiatives, including the Global Carbon Budget, NASA Carbon Monitoring System, and US Greenhouse Gas Center. Despite its growing importance in combating climate change and shaping carbon policies, ED's expensive computation significantly limits its ability to estimate carbon dynamics at the global scale with high spatial resolution. Recently, machine learning (ML) models have shown promising potential in approximating theory-based models with interesting success in various domains including weather forecasting, thanks to the open-source benchmark datasets made available. However, there are currently no publicly available ML-ready datasets for global carbon dynamics forecasting in forest ecosystems. The limited data availability hinders the development of corresponding ML emulators. Furthermore, the inputs needed for running ED are highly complex with over a hundred variables from various remote sensing products. To bridge the gap, we develop a new ML-ready benchmark dataset, \textit{CarbonGlobe}, for carbon dynamics forecasting, featuring that: (1) the data has a global-scale coverage at 0.5$^\circ$ resolution; (2) the temporal range spans 40 years; (3) the inputs integrate extensive multi-source data from different sensing products, with calibrated outputs from ED; (4) the data is formatted in ML-ready forms and split into different evaluation scenarios based on climate conditions, etc.; (5) a set of problem-driven metrics is designed to develop benchmarks using various ML models to best align with the needs of downstream applications. Our dataset and code are publicly available on Kaggle and GitHub: https://www.kaggle.com/datasets/zhihaow/carbonglobe and https://github.com/zhwang0/carbon-globe.

Croissant File: json

Dataset URL: https://www.kaggle.com/datasets/zhihaow/carbonglobe

Code URL: https://github.com/zhwang0/carbon-globe

Primary Area: AL/ML Datasets & Benchmarks for physics (e.g. climate, health, life sciences, physics, social sciences)

Submission Number: 1982

Loading