SmokeViz: A Large-Scale Satellite Dataset for Wildfire Smoke Detection and Segmentation

Rey Koki; Michael McCabe; Dhruv Kedar; Josh Myers-Dean; Annabel Wade; Jebb Q. Stewart; Christina Kumler-Bonfanti; Jed Brown

SmokeViz: A Large-Scale Satellite Dataset for Wildfire Smoke Detection and Segmentation

Rey Koki, Michael McCabe, Dhruv Kedar, Josh Myers-Dean, Annabel Wade, Jebb Q. Stewart, Christina Kumler-Bonfanti, Jed Brown

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: wildfire smoke, deep learning, dataset, remote sensing, satellite, semi-supervised learning

TL;DR: We use physics-guided semi-supervised learning to align human-labeled wildfire smoke annotations with GOES satellite imagery, creating SmokeViz, a large-scale dataset for smoke plume segmentation.

Abstract: The global rise in wildfire frequency and intensity over the past decade underscores the need for improved fire monitoring techniques. To advance deep learning research on wildfire detection and its associated human health impacts, we introduce **SmokeViz**, a large-scale machine learning dataset of smoke plumes in satellite imagery. The dataset is derived from expert annotations created by smoke analysts at the National Oceanic and Atmospheric Administration, which provide coarse temporal and spatial approximations of smoke presence. To enhance annotation precision, we propose **pseudo-label dimension reduction (PLDR)**, a generalizable method that applies pseudo-labeling to refine datasets with mismatching temporal and/or spatial resolutions. Unlike typical pseudo-labeling applications that aim to increase the number of labeled samples, PLDR maintains the original labels but increases the dataset quality by solving for intermediary pseudo-labels (IPLs) that align each annotation to the most representative input data. For SmokeViz, a parent model produces IPLs to identify the single satellite image within each annotations time window that best corresponds with the smoke plume. This refinement process produces a succinct and relevant deep learning dataset consisting of over 160,000 manual annotations. The SmokeViz dataset is expected to be a valuable resource to develop further wildfire-related machine learning models and is publicly available at \url{https://noaa-gsl-experimental-pds.s3.amazonaws.com/index.html#SmokeViz/}.

Croissant File: json

Dataset URL: https://noaa-gsl-experimental-pds.s3.amazonaws.com/index.html#SmokeViz/

Code URL: https://github.com/reykoki/SmokeViz

Primary Area: Datasets & Benchmarks for applications in computer vision

Submission Number: 2027

Loading