# Alberta Wells Dataset: Pinpointing Oil and Gas Wells from Satellite Imagery

Millions of abandoned oil and gas wells are scattered across the world, leaching methane into the atmosphere and toxic compounds into the groundwater. 
Many of these locations are unknown, preventing the wells from being plugged and their polluting effects averted. 
Remote sensing is a relatively unexplored tool for pinpointing abandoned wells at scale. 
We introduce the first large-scale dataset for this problem1, leveraging medium-resolution multi-spectral satellite imagery from Planet Labs. 
Our curated dataset comprises over 213,000 wells (abandoned, suspended, and active) from Alberta, a region with especially high well density, sourced from the Alberta Energy Regulator and verified by domain experts. 
We evaluate baseline algorithms for well detection and segmentation, showing the promise of computer vision approaches but also significant room for improvement.


<div style="border-width:1px; border-style:solid; border-color:#d2db8c; padding-left: 1em; padding-right: 1em; ">
  
<h2 style="margin-top:5px;">Dataset Related Links</h2>

- **Dataset Folder :** https://www.dropbox.com/scl/fo/t59p9aalqodm36h5vskge/AO_2J1nk5TpELDAH4l4RDE0?rlkey=ksa51mwmqk39q0cfsjn8fal7x&st=oib5sa9t&dl=0

- **Datasheets for Datasets :** https://www.dropbox.com/scl/fi/a3u9ur7e6vn1e7svxqkrl/AWD_Datasheets_for_Dataset.pdf?rlkey=vzronprpcy4jyf9viwhcbc6og&st=6dqjwf3p&dl=0

- **Croissant metadata :** https://www.dropbox.com/scl/fi/ugx8dji1asz8l93l3m66w/croissant-alberta_wells_dataset.json?rlkey=4n5fliwe5rhuyrmddt2c75pjm&st=2zzp74d9&dl=0

- **Dataset License :** https://www.dropbox.com/scl/fi/5dtdhj1kur4fd7rs7lcrw/LICENSE.txt?rlkey=5tphzgwq2wouacn45zl5hqk6f&st=rbqz7uk1&dl=0

- **Dataset Samples Illustration :** https://www.dropbox.com/scl/fo/cjjgl6739ydqawyym2itw/AOqClSoLJuAVAhquBpiMi8g?rlkey=svl3aesacpq5tvv2tpu0dwknc&st=onlyfxvo&dl=0

</div>

## Setup

### Cloning Repository

```python
#TODO will be updated once code is made public
git clone <anonymous>
```

### Setup Conda Enviroment

```python
cd <inside the code repository with main.py>
export AWD_CODEBASE=$(pwd)
conda create --name awd python=3.11.7
conda activate awd
cd setup
pip install -r requirements.txt
cd ..
```

### Downloading Dataset

```python
cd $AWD_CODEBASE
cd downloads
```

#### Validation Set

```python
wget -O Validation.tar.gz 'https://www.dropbox.com/scl/fi/t7xploy60rb4t32c79qvz/Validation.tar.gz?rlkey=qjsvufl29jxqvg0fw9cykusnt&st=p5hiqo7i&dl=1'

tar -xzvf Validation.tar.gz
```

#### Test Set

```python
wget -O Test.tar.gz 'https://www.dropbox.com/scl/fi/yezmzt0cx6mv1z3d9kz1p/Test.tar.gz?rlkey=o85irapbufr6qxxa30tsauh8n&st=odj9l3fc&dl=1'

tar -xzvf Test.tar.gz
```

#### Train Set (Will Require Some Time due to Large File Size)

```python
wget -O Train.tar.gz 'https://www.dropbox.com/scl/fi/90tzje0ndukbccyh7699k/Train.tar.gz?rlkey=qewntwwefku8d31e0i724r8lg&st=5rmwzmwc&dl=1'

tar -xzvf Train.tar.gz
```


### Splitting the Dataset into Smaller Files for Faster Access

```python
python setup/file_handling_cc_train.py $PWD

python setup/file_handling_cc_eval.py $PWD

python setup/file_handling_cc_test.py $PWD

```

## Benchmark Details

### Binary Well Segmentation
For Segmentation we used the below architecture:

| Model | Paper | 
| :---: | :---: |
|UNet| [UNet Paper](https://arxiv.org/abs/1505.04597)|
|DeepLabv3plus|[DeepLabv3plus Paper](https://arxiv.org/abs/1802.02611)|
|UperNet|[UperNet Paper](https://arxiv.org/abs/1807.10221)|
|SegFormer|[SegFormer Paper TBD](https://arxiv.org/abs/2105.15203)|

We train all CNN models with a ResNet50 backbone, a batch size of 128, and the BCELogits loss function. A cosine annealing scheduler adjusts the learning rate cyclically. 
For transformer models, Segformer and UperNet both use a Dice loss function and a polynomial learning rate scheduler. Segformer uses a mit-b0-ade backbone with a batch size of 128, while UperNet uses a Swin Small Transformer with a batch size of 64. All models are optimized with AdamW for 50 epochs.

### Object (Well) Detection

For Object Detection we used the below architecture:

| Model | Paper | 
| :---: | :---: |
|RetinaNet| [RetinaNet Paper](https://arxiv.org/abs/1708.02002)|
|Faster RCNN|[Faster RCNN Paper](https://arxiv.org/abs/1506.01497)|
|DETR|[DETR Paper](https://arxiv.org/abs/2005.12872)|

All object detection models are trained with a ResNet50 backbone. The batch size is 256 for Faster R-CNN and DETR, and 512 for RetinaNet. Faster R-CNN and RetinaNet use a cosine annealing scheduler, while DETR uses a step-wise scheduler that reduces the learning rate every 50 epochs. We train Faster R-CNN and RetinaNet for 120 epochs and DETR for 150 epochs. All models are optimized using AdamW.

### Evaluation

We evaluate the binary segmentation task based on IoU, Precision, Recall, and F1-Score. Precision helps reduce false positives, while Recall reduces false negatives. IoU measures mask overlap for segmentation accuracy, and F1-Score balances precision and recall, considering false positives and false negatives.

When evaluating binary object (well) detection, we calculate IoU at different thresholds (IoU<sub>0.1</sub>, IoU<sub>0.3</sub>, IoU<sub>0.5</sub>) to gauge the model's ability to differentiate between predicted and actual well locations at varying overlap levels. Additionally, we analyze Mean Average Precision (mAP) metrics like mAP<sub>50</sub> and mAP<sub>50:95</sub> to understand the model's precision-recall trade-off and detection accuracy across different IoU thresholds.

## Running Experiments

```python
TORCH_USE_CUDA_DSA=1 python main.py --config=configs/<location of config file for experiment i.e. either train or inference> --SEED=333
```

To run inference using a specific checkpoint after training, or to continue training from a checkpoint, make the following change:

```
    "checkpoint_file": "False",
```

to

```
    "checkpoint_file": <relative location of checkpoint in codebase>",
```


'configs' folder contains experimental configuration for each backbone corresponding to a task. For details about configuration used for each architecture and task refer to the main paper.

'scripts' folder contains experimental scripts to run all experiments related to a backbone configuration reported for benchmarking.

We use comet_ml to track experiments which can can enabled or disbaled by setting 'disabled=True/False' in 'main.py'.



## Context & Data

### Dataset Information

The goal of this dataset is to aid in training deep learning models to identify oil and gas wells, including abandoned, suspended, and active ones. This capability will enable the detection of wells in a specific area and facilitate comparison with government records. If discrepancies are discovered, experts can conduct further investigations, potentially uncovering abandoned or suspended wells that may not be documented in government records.

### Dataset Structure

We provide training, validation, and testing sets, split using our proposed algorithm (as described in Section 3.2 of main paper) to create a well-distributed dataset. This dataset represents various geographical regions and offers a diverse benchmark for evaluation and testing. Each dataset split is saved in an HDF5 format file, structured as described in the following sections, and then compressed into a .tar.gz file for faster transfer. Details on the number of samples in each set and the size of the dataset, both original and compressed, are presented in Table below:

| Dataset  Split | No of  Samples | No of Wells  in Split | Original HDF5 File Size (in Gb) | Compressed .tar.gz  File Size (in Gb) |
|:--------------:|:--------------:|:---------------------:|:-------------------------------:|:-------------------------------------:|
|      Train     |     167436     |         194231        |               322               |                  100                  |
|   Validation   |      9463      |          8243         |                19               |                  5.7                  |
|      Test      |      11789     |         10973         |                24               |                  7.1                  |
|      Total     |     188688     |         213447        |               365               |                 112.8                 |


### Dataset File Directory Structure

The following directory structure is used for each dataset file being stored in a Hierarchical Data Format 5 (HDF5 i.e. a .h5 file):

    <Train/Test/Val>Set.h5
    |---image
       |---<sample_name>
          |---Satellite Image (Multispectral Rasterio Image [.tiff] Data)
          |---Meta Data of <sample_name>
    |---label
       |---binary_seg_maps
          |---<sample_name>
             |---Binary Segmentation Map (Rasterio Image [.jpg] Data)
       |---multi_class_seg_maps
          |---<sample_name>
             |---Multiclass Segmentation Map (Rasterio Image [.jpg] Data)
       |---bounding_box_annotations
          |---<sample_name>
             |---Bounding Box JSON Data (COCO Format)
    |---author:Anonymous Author(s)
    |---description: Alberta Wells Dataset: Pinpointing Oil and Gas Wells from Satellite Imagery


### Dataset File (Splitted) Directory Structure

However, in order to improve the efficiency of the data loader, we break the larger .h5 dataset into smaller .h5 files, leading to the following data structure:


    Sample_Splited_File.h5
    |---image
       |---Satellite Image (Multispectral Rasterio Image [.tiff] Data)
       |---Meta Data
    |---label
       |---binary_seg_maps
          |---Binary Segmentation Map (Rasterio Image [.jpg] Data)
       |---multi_class_seg_maps
          |---Multiclass Segmentation Map (Rasterio Image [.jpg] Data)
       |---instance_binary_seg_maps
            |---Instance Segmentation Map (Rasterio Image [.jpg] Data)
       |---bounding_box_annotations
          |---Bounding Box JSON Data (COCO Format)
    |---author:Anonymous Author(s)
    |---description: Alberta Wells Dataset: 
                     Pinpointing Oil and Gas Wells from Satellite Imagery

### Dataset Size & Distribution of Samples

The proposed dataset comprises over 94,000 patches of satellite imagery containing wells, with a total of 188,000 patches sourced from Planet Labs \cite{planetlabs}. This dataset covers more than 213,000 individual wells. Details about the distribution of the number of patches, wells present, and dataset split sizes are provided in Table \ref{tab:table2}, with the distribution of the number of wells per sample being described in Table below:

| No of Wells  in a Sample | Frequency of Well Instances in a Sample Training Split | Validation Split | Test Split |
|--------------------------|--------------------------------------------------------|------------------|------------|
| 1                        | 44299                                                  | 3393             | 4128       |
| 2 - 3                    | 25378                                                  | 979              | 1242       |
| 4 - 5                    | 7899                                                   | 190              | 328        |
| 6 - 10                   | 4927                                                   | 123              | 227        |
| 11 - 15                  | 751                                                    | 23               | 38         |
| 16 - 25                  | 333                                                    | 11               | 19         |
| 26 - 35                  | 67                                                     | 10               | 2          |
| 36 - 55                  | 45                                                     | 3                | 0          |
| 56 - 75                  | 18                                                     | 0                | 0          |
| 76 - 125                 | 1                                                      | 0                | 0          |
| Total                    | 83718                                                  | 4732             | 5984       |


### PlanetScope Satellite Imagery

For our experiments, we selected a 4-band (RGBN) satellite imagery product (ortho_analytic_4b_sr) from Planet Labs.
This product uses Planet's newest PSB.SD instrument, which features the next-generation “PSBlue” telescope with a larger 47-megapixel sensor. 
The satellite images are corrected for atmospheric conditions and spectral response consistency. These multispectral products are tailored for monitoring in agriculture and forestry, offering precise geolocation and cartographic projection. They are ideal for tasks like land cover classification, with radiometric corrections ensuring accurate data transformation.
It is designed to be interoperable with Sentinel-2 imagery in several bands.  The instrument provides a frame size of 32.5 km x 19.6 km, an image capture capacity of 200 million km²/day, and an imagery bit depth of 12-bit, with a ground sample distance (nadir) ranging from 3.7 m to 4.2 m. The frequency of each band of image is described in Table below.

|      Band of Image     | Frequency (in nm) of Spectral Band |
|:----------------------:|:----------------------------------:|
|      Band 1 = Blue     |              465 - 515             |
|     Band 2 = Green     |              547 - 585             |
|      Band 3 = Red      |              650 - 680             |
| Band 4 = Near-infrared |              845 - 885             |


### Label Data Description
For our experiments, we create single-channel segmentation maps to locate wells and multi-class segmentation maps for different well states. We provide COCO format object detection labels for wells. To ensure consistency, we standardize the diameter of a well site to 90 meters when annotating, resulting in a 30-pixel diameter in the labels. The Figures below illustrates image patches with labels, and shows spectral bands in an image with bounding box annotations.

Illustration of Sample (1):

![Example Image](resources/spectral/test_7049.png)
![Example Image](resources/labels/test_7049.png)

Illustration of Sample (2):

![Example Image](resources/spectral/train_split_3_2045.png)
![Example Image](resources/labels/train_split_3_2045.png)

Illustration of Sample (3):

![Example Image](resources/spectral/eval_33.png)
![Example Image](resources/labels/eval_33.png)

Sample of Bounding Box Annotation:
```
   [
      {
               'id': 0, 
               'image_id': 'eval_7028', 
               'category_id': 1, 
               'bbox': [46, 145, 29, 29], 
               'iscrowd': 0
         }, 
      {
               'id': 1, 
               'image_id': 'eval_7028', 
               'category_id': 2, 
               'bbox': [45, 127, 29, 29], 
               'iscrowd': 0
         }
   ]
```
### Meta Data Description

Each dataset sample is accompanied by metadata, including the sample name (sample ID in string format), the presence of a well in the sample, the number of wells in the sample, and whether a well of a specific category is present in the sample. Table below provides an illustration of metadata associated with a sample.

| Meta-Data Attribute Name |   Value   |
|:------------------------:|:---------:|
|        Sample_Name       | eval_6934 |
|       wells_present      |    True   |
|        no_of_wells       |     10    |
|  Abandoned_well_present  |    True   |
|    Active_well_present   |    True   |
|  Suspended_well_present  |    True   |

## License

This repository is under the CC-BY-NC 4.0 license. Please refer to [LICENSE](LICENSE.md) for details.

## Acknowledgment

This repository is inspired by [Segmentation Model Pytorch](https://github.com/qubvel/segmentation_models.pytorch/tree/master), [FoMo-Bench](https://github.com/RolnickLab/FoMo-Bench), [DETR](https://github.com/facebookresearch/detr/tree/main) and [PyTorch Project Template](https://github.com/moemen95/Pytorch-Project-Template). 

## Citation

```
@inproceedings{awd_dataset,
    author    = {Anonymous Author(s)},
    title     = {Alberta Wells Dataset: Pinpointing Oil and Gas Wells from Satellite Imagery},
    booktitle = {Preprint},
    year      = {2024}
}
```