To create Buildings-900K, we extract 900K total energy consumption time
series (in kilowatt-hours (kWh)) from each of the non-upgraded buildings in the 2021 version of
the EULP. To promote accessibility of our dataset, we also aggregate the 15-minute resolution to
hourly to reduce the size. This data requires about 110 GB to store, significantly less than the entire
EULP (70+ TB). We store all buildings within each PUMA in a single Parquet file by year (there
are two years, 2018 and an aggregated “typical meteorological year” (TMY) [45]) and by building
type (residential/commercial), which amounts to 9,600 Parquet files. Each file has a column for the
timestamp and a column for each building’s energy consumption (8,760 rows per file). Processing
the EULP to extract this data took ∼3 days with Apache PySpark on a 96-core AWS cloud instance.