## Chesapeake Land Cover
### Overview
This dataset contains high-resolution aerial imagery from the USDA NAIP program [1], high-resolution land cover labels from the Chesapeake Conservancy [2], low-resolution land cover labels from the USGS NLCD 2011 dataset [3], low-resolution multi-spectral imagery from Landsat 8 [4], and high-resolution building footprint masks from Microsoft Bing [5], formatted to accelerate machine learning research into land cover mapping. The Chesapeake Conservancy spent over 10 months and $1.3 million creating a consistent six-class land cover dataset covering the Chesapeake Bay watershed. While the purpose of the mapping effort by the Chesapeake Conservancy was to create land cover data to be used in conservation efforts, the same data can be used to train machine learning models that can be applied over even wider areas.

The organization of this dataset (detailed below) will allow users to easily test questions related to this problem of geographic generalization, i.e. how to train machine learning models that can be applied over even wider areas. For example, this dataset can be used to directly estimate how well a model trained on data from Maryland can generalize over the remainder of the Chesapeake Bay.

Python code for training and testing deep learning models (Keras/TensorFlow based) can be found in the accompanying GitHub repository:

https://github.com/calebrob6/land-cover

Further developments in models and related tools can be found at:

https://github.com/Microsoft/landcover

Papers using a superset of this data include [6, 7]. Paper [8] uses data from the same sources.

### Dataset organization
#### Tiles
At the highest level this dataset is organized by tiles. A tile is a spatial area measuring roughly **6km x 7.5km** (with definitions that roughly match up with USGS quarter quadrangles). Each tile comes with seven corresponding GeoTIFFs:

- NAIP 2013/2014 imagery ("_naip-new.tif" suffix)
- NAIP 2011/2012 imagery ("_naip-old.tif" suffix)
- Chesapeake Conservancy land cover labels ("_lc.tif" suffix)
- NLCD 2011 labels ("_nlcd.tif" suffix)
- Landsat 8 leaf-on composite ("_landsat-leaf-on.tif" suffix)
- Landsat 8 leaf-off composite ("_landsat-leaf-off.tif" suffix)
- Building footprint mask ("_buildings.tif" suffix)

These GeoTIFFs are all aligned and at a **1m spatial resolution**. Here, the low-resolution NLCD labels (natively at a 30m spatial resolution) have been reprojected to 1m with nearest-neighbor upsampling, while the NAIP and high-resolution land cover labels are natively aligned at 1m.

The Landsat 8 leaf-on and leaf-off composites are created from the median of the non-cloudy T1 surface reflectance pixels between April 1-September 30 and October 1st-March 31 in the years 2013-2017 respectively. The final composites are <u>upsampled to 1m spatial resolution</u>. Finally, the building footprints have been rasterized to a 1m resolution from their native polygon format, also with nearest-neighbor sampling.

There are **732 total tiles**, 125 sampled uniformly from each of the following (state, year) pairs:

- Delaware 2013 (only 107 tiles)
- New York 2013
- Maryland 2013
- Pennsylvania 2013
- West Virginia 2014
- Virginia 2014

The ~125 tiles from each (state, year) pair are further split into 100 "train tiles" (except for Delaware, which has 82 train tiles), 5 "validation tiles", and 20 "test tiles". The filenames for each split are listed in the accompanying CSV files. For example, the filenames associated with the 125 tiles from West Virginia can be found in the following CSVs:

- wv_1m_2014_train_tiles.csv
- wv_1m_2014_val_tiles.csv
- wv_1m_2014_test_tiles.csv

#### Patches

This dataset also includes 500 pre-generated patches from each training and validation tile. Here, a patch is defined as a random 256×256 (meter) crop from the tile’s extent. This results in 50,000 training patches and 2,500 validation patches per (state, year) pair. The filenames for each patch (and accompanying metadata) are also listed in a CSV for each (state, year). For example, the training and validation patches for West Virginia can be found in the following CSVs:

- wv_1m_2014_train_patches.csv
- wv_1m_2014_val_patches.csv

Furthermore, the spatial extent of each patch can be found in a similarly named GeoJSON file:

- wv_1m_2014_train_patches.geojson
- wv_1m_2014_val_patches.geojson

Each shape in the GeoJSON has a patch_id key that can be matched back to a row in the CSV mentioned above.

Each patch is a (29x256x256) tensor. The channels are described as follows:

- Channels 1-4 contain the R, G, B, and NIR bands respectively of the NAIP "new" imagery (from 2013/2014 NAIP). Values are uint8s (i.e. in the range [0, 255]).
- Channels 5-8 contain the R, G, B, and NIR bands respectively of the NAIP "old" imagery (from 2011/2012 NAIP).
- Channel 9 contains the high-resolution land cover labels:
    - 1 = water
    - 2 = tree canopy / forest
    - 3 = low vegetation / field
    - 4 = barren land
    - 5 = impervious (other)
    - 6 = impervious (road)
    - 15 = no data
- Channel 10 contains the low-resolution NLCD labels. Values match those described here. The values 0 and 255 indicate that no data is available.
- Channels 11-19 contain the 9 bands of our Landsat 8 surface reflectance leaf-on imagery. Bands are described here. We take B1, B2, B3, B4, B5, B6, B7, B10, B11. Values are float32.
- Channels 20-28 contain the 9 bands of our Landsat 8 surface reflectance leaf-off imagery.
- Channel 29 contains a building footprint mask generated from Bing Building Footprints.

#### Download links

- Delaware (65GB)
- New York (76GB)
- Maryland (79GB)
- Pennsylvania (82GB)
- Virginia (83GB)
- West Virginia (85GB)

#### Contact

For questions about this dataset, contact calebrob6+lcmcvpr2019@gmail.com.