# AVA Dataset

## 1. Action Recognition

- Dataset: Moments_in_Time_Raw  
- Dataset link: http://moments.csail.mit.edu/splits/Moments_in_Time_Raw_v2.zip  
- Paper: *Moments in Time Dataset: one million videos for event understanding*  
- How to process:
  1. Run `action/analyze.py` to analyze class distributions and extract frames.
  2. Run `action/final_json.py` to generate formatted training and validation JSONs.

## 2. Counting

### CarPK Dataset
- Paper: *Drone-based Object Counting by Spatially Regularized Regional Proposal Network*
- How to process:
  1. Run `count/car/dataset_download.py` to download the dataset.
  2. Run `count/car/code.py` to generate formatted training and validation JSONs.

### Crowd Surveillance
- Dataset link: https://aistudio.baidu.com/datasetdetail/177162/0
- How to process:
  - Run `count/crowd/code.py`  to generate formatted training and validation JSONs.

### FSC (Few-Shot Counting)
- Paper: *Learning To Count Everything*
- How to process:
  1. Run `count/FSC/data_download.py` to download dataset.
  2. Run `count/FSC/data_check.py` to generate formatted training and validation JSONs.

### VQAv2 (Visual Question Answering)
- Paper: *VQA: Visual Question Answering*
- How to process:
  1. Run `count/vqav2/data_download.py`to download dataset.
  2. Run `count/vqav2/threashold_out.py` to analyze and do filtering for the data
  3. Run `count/vqav2/threshold_with_csv.py` to generate formatted training and validation JSONs.

### LVIS (Large Vocabulary Instance Segmentation)
- Paper: *LVIS: A Dataset for Large Vocabulary Instance Segmentation*
- How to process:
  1. Run `count/LVIS/data_download.py` to download dataset.
  2. Run `count/LVIS/analyze.py` to generate formatted training and validation JSONs.

## 3. Fine-Grained Classification

### Aircraft
- Paper: *Fine-Grained Visual Classification of Aircraft*
- How to process:
  1. Run `fine_grained/aircraft/data_download.py` to download dataset.
  2. Run `fine_grained/aircraft/aircraft.py` to generate formatted training and validation JSONs.

### Birds
- Paper: *The Caltech-UCSD Birds-200-2011 Dataset*
- How to process:
  1. Run `fine_grained/CUB/data_download.py` to download dataset.
  2. Run `fine_grained/CUB/CUB.py` to generate formatted training and validation JSONs.

### iNaturalist Subsets
- Paper: *Benchmarking Representation Learning for Natural World Image Collections*
- Dataset link: https://ml-inat-competition-datasets.s3.amazonaws.com/2021/train.tar.gz, https://ml-inat-competition-datasets.s3.amazonaws.com/2021/train.json.tar.gz, https://ml-inat-competition-datasets.s3.amazonaws.com/2021/val.tar.gz, https://ml-inat-competition-datasets.s3.amazonaws.com/2021/val.json.tar.gz
- Subsets used: Animal, Fungi, Plantae
- How to process:
  - Animal: animal.py
  - Fungi: fungi.py
  - Plantae: plantae.py

## 4. Localization

### DIRO
- Paper: *Object detection in optical remote sensing images: A survey and a new benchmark*
- How to process:
  1. Run `localization/DIOR/data_download.py` to download dataset.
  2. Run `localization/DIOR/analyze.py` to analyze class distributions and extract frames.
  3. Run `localization/DIOR/final_json.py` to generate formatted training and validation JSONs.

### LVIS
- Paper: *LVIS: A Dataset for Large Vocabulary Instance Segmentation*
- How to process(Data Downloading in count/LVIS):
  1. Run `localization/LVIS/threshold_1.py` to do analyzing and filtering process to the data
  2. Run `localization/LVIS/final_json.py`to do filtering process to the data

### Object365
- Paper: *Objects365: A Large-Scale, High-Quality Dataset for Object Detection*
- How to process:
  1. Run `localization/Object365/data_download.sh` to download dataset. 
  2. Run `localization/Object365/analyze.py` to analyze class distributions and extract frames.
  3. Run `localization/Object365/final_json.py`to generate formatted training and validation JSONs.

### iNaturalist Mammal and Aves
- Paper: *The iNaturalist Species Classification and Detection Dataset*
  1. Run `localization/Inat/data_download.py` to download dataset. 
  2. Run `localization/Inat/analyze_ave.py` to analyze AVE class distributions and extract frames.
  3. Run `localization/Inat/analyze_mammalia.py`to analyze mammalia class distributions and extract frames.
  4. Run `final_json_mal.py`to generate formatted training and validation JSONs for mammalia.
  5. Run `final_json_aves.py`to generate formatted training and validation JSONs for AVE.


## 5. OCR (Optical Character Recognition)

### COCO-Text
- Paper: *COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images*
- How to process:
  1. Run `OCR/coco_text/data_download.py` to download dataset.
  2. Run `OCR/coco_text/analyze.py` to generate formatted training and validation JSONs.

### IIIT-5K
- Paper: *Scene Text Recognition using Higher Order Language Priors*
- How to process:
  1. Run `OCR/IIIT/data_download.py` to download dataset.
  2. Run `OCR/IIIT/analyze.py` to generate formatted training and validation JSONs.

### TextVQA
- Paper: *Towards VQA Models That Can Read*
- How to process:
  1. Run `data_download.py` to download dataset.
  2. Run `code_filtering_check.py` to Visualize the top-scoring OCR bounding box (based on area) for 40 randomly sampled TextVQA images and Find and visualize the 10 smallest OCR bounding boxes (area > 100) across all eligible TextVQA images.
  3. Run `run_bbox_filter.py` to for each area threshold (500, 800, 1000), visualize the 10 smallest OCR bounding boxes larger than the threshold and Visualize 30 smallest OCR bounding boxes that both: Have area > 2000 and Match an answer word from the TextVQA annotations
  4. Run `analyze.py` to generate formatted training and validation JSONs.

## 6. Recognition

- Datasets used same as localization: DIRO, LVIS, Object365, iNaturalist Mammal and Aves
- How to process:
  1. Run `analyze.py` to automatically remove background padding from images using the average processor RGB mean, Visualize one validation sample and print the associated question and answer and Visualize one training sample with its bounding box, question, and answer.
  2. Run `transfer_loc_rec.py` to convert results to recognition format.

## 7. Color Understanding

- Datasets used: LVIS, Object365
- How to process:
  1. Run `color/analyze_LVIS.py` and `color/analyze_365.py`
  2. Then run `color/final_json.py` to unify output.


## 8. Orientation

### EgoOrientBench
- Description: Statistics + filtering for egocentric orientation estimation.
- How to process:
  1. Go to `Orientation/EgoOrientBench/code/` and run your preprocessing scripts.
  2. You may also use `Orientation/EgoOrientBench/codeV2/` for updated filters.

### Cure_or
- Description: Orientation prediction using Cure_or dataset.
- How to process:
  1. Run `Orientation/Cure_or/code/get_metadata.py` to prepare metadata.
  2. Run `Orientation/Cure_or/code/test.py` to evaluate model or generate results.

### Orientation
- Description: Orientation dataset preprocessing and JSON generation.
- How to process:
  1. Run scripts in `Orientation/Orientation/code/` or `Orientation/Orientation/codev2/` to prepare training and validation data.


## 9. Scene

### place365
- Description: Scene recognition using Places365 dataset.
- How to process:
  - Use code in `Scene/place365/code/` to format the dataset.

### AID
- Description: Aerial Image Dataset for scene classification.
- How to process:
  - Use scripts in `Scene/AID/code/` to prepare data for training.

### Places_merge
- Description: Merged places-based dataset for extended scene understanding.
- How to process:
  - Scripts located in `Scene/Places_merge/code/`.


## 10. Texture

### Texture_tmp
- Description: Texture classification experiments (multiple versions).
- How to process:
  - Available versions:
    1. `Texture/Texture_tmp/code/`
    2. `Texture/Texture_tmp/codev2/`
    3. `Texture/Texture_tmp/codev3/`
    4. `Texture/Texture_tmp/codev4/`



## 11. External Absolute Depth

### Custom Alignment-Based Dataset
- Description: Filters and analyzes synthetic depth estimation images based on object alignment and spacing.
- How to process:
  1. Ensure the bounding box annotation file `label.json` is in the working directory.
  2. Run `absolute_depth/analysis.py` to:
     - Filter valid images where objects are sufficiently spaced.
     - Count valid images per object number.
     - Generate and save a histogram as `valid_image_counts_histogram.png`.


## 12. External Relative Depth

### KITTI Relative Depth Dataset (Pairwise Comparison)
- Description: Constructs a relative depth estimation dataset from KITTI by generating object pairs with meaningful depth differences, creating cropped image pairs, and formatting them for training and evaluation.

- How to process:
  1. **Pair Filtering**  
     Run `relative_depth_json.py` to:
     - Filter valid object pairs with large enough depth differences.
     - Save results to `filtered_label_with_pairs.json`.
     - Generate histograms: `pair_counts.png`, `depth_difference_histogram.png`, and `object_counts_in_pairs.png`.

  2. **Image Cropping and JSON Generation**  
     Run `relative_depth_image.py` to:
     - Read `filtered_label_with_pairs.json` and KITTI images.
     - Create cropped images with visual bounding boxes for object pairs.
     - Format and save:
       - `train/train.json`
       - `val/val.json`
       - `val/val_ans.json`

  3. **Answer Format Normalization**  
     Run `json_ans_fix.py` to:
     - Convert "red" to "1. red" and "blue" to "2. blue" in `train.json` and `val_ans.json`.

  4. **Video Visualization**  
     Run `relative_depth_video.py` (or `create_video_from_json`) to:
     - Generate a video visualizing cropped image pairs and their depth differences.
     - Overlay bounding boxes and text on each frame.
     - Save to `output_video.mp4`.

## 13. Emotion

### Facial Emotion Recognition Dataset (EXPW + RAF-DB)
- Description: Combines images from EXPW and RAF-DB datasets, normalizes emotion classes, and formats them into question-answer format with red bounding box annotations.

---

### How to process:

#### 1. **Prepare Labeled JSON from EXPW and RAF-DB**
Run `prepare_json.py` to:
- Parse EXPW annotations from `label.lst` with bounding boxes.
- Parse RAF-DB annotations with emotion folders and bounding box text files.
- Merge and save output batches to `processed_data.json`.

#### 2. **Generate Training/Validation Images and JSONs**
Run `prepare_data.py` to:
- Draw red bounding boxes for emotion regions.
- Split data into 70% training and 30% validation.
- Save:
  - `train/train.json`
  - `val/val.json`
  - `val/val_ans.json`
  - Corresponding cropped images into `train/images/` and `val/images/`.

#### 3. **Read Emotion Distribution Statistics**
Run `temp.py` to:
- Print number of samples per emotion class (aggregated from `processed_data.json`).

#### 4. **Create Emotion Visualization Videos**
Run `video.py` to:
- Select 10 images per emotion (5 from `expw`, 5 from `rafdb`).
- Draw red (EXPW) and blue (RAFDB) bounding boxes.
- Create one `.mp4` video per emotion in `videos/`.


## 14. Internal Relative Depth

### NYUv2 Pairwise Object Depth Comparison
- Description: Generates question-answer pairs for internal relative depth estimation by selecting object pairs from NYUv2 dataset based on label agreement, depth difference, and spatial separation. The final output is a set of balanced QA samples for learning spatial depth comparisons across categories.

### How to process:

1. **Generate Pair Statistics**
   - Run `analysis.py` to:
     - Group questions by (object1, object2) label pairs.
     - Compute total question count, answer bias, and number of unique images.
     - Save summary to `pairwise_label_summary.csv`.

2. **Filter QA Pairs with Depth and BBox Constraints**
   - Run either `filter_chunk.py` (low-memory) or `filter.py` (preload version) to:
     - Apply depth difference threshold (≥ 0.5m).
     - Remove overlapping/invalid bounding boxes.
     - Keep only label pairs with at least 10 examples per direction.
     - Downsample to max 30 per direction and 5 pairs per label.
     - Save to `final_selected_questions.csv`.
     - Plot: `afterFiltering_summary.jpg`.


## 15. Spatial

### Spatial Reasoning QA Data Preparation
- Description: Provides conversion, validation, and generation utilities for spatial datasets including NYUv2, LVIS, Object365, and custom selections. Supports bounding box processing, label merging, histogram analysis, and random sampling.

### How to process:

1. **Convert Original Datasets**
- `1convert_train.py`, `1convert_val.py`:
  Convert raw train/val to unified format.
- `1convert_Train_bbox.py`, `1convert_Val_bbox.py`:
  Add bounding box annotations.

2. **Label Checking and Combination**
- `checkLabel.py`: Check for valid labels.
- `check_still_unique.py`: Ensure object uniqueness.
- `combineLabel.py`: Merge multiple label sources.

3. **Dataset Generation**
- `nyu_depth_generation.py`, `nyu_depth_test.py`: NYUv2-based spatial QA generation.
- `lvis_generation.py`, `lvis_test.py`: LVIS spatial reasoning QA.
- `object365_generation.py`, `object365_test.py`: Object365 spatial QA generation.

4. **Filtering & Sampling**
- `random_50.py`, `ran50_test.py`: Sample 50 random object types.
- `shuffle.py`: Shuffle QA samples.
- `split.py`: Train/val/test split.
- `unique_count.py`: Count unique QA targets.

5. **Analysis**
- `analyze.py`: Question/object distribution.
- `histogram_all.py`: Combined dataset histograms.
- `min_max.py`: Bounding box spread analysis.


## 16. Internal Absolute Depth

### Absolute Depth Estimation using NYU-Depth (Bin-Based)
- Description: Generates absolute depth questions for object instances in the NYU-Depth v2 dataset. After filtering based on depth distribution and bounding box area, objects are framed as questions asking for real-world depth estimation.

### How to process:

1. **Object Detection and Metadata Extraction**
   - Run `analysis.py` to:
     - Draw red bounding boxes on all valid object instances.
     - Save images to `image_bbox/` and metadata to `bbox_metadata_all.csv`
     - Total object samples: 28,100

2. **Filter Objects Based on Depth and BBox Properties**
   - Run `filter.py` to:
     - Step 1: Keep labels with ≥ 3 depth bins (resized to 384×384)
     - Step 2: Remove bins with < min_img, cap bins with > max_img using bbox area
     - Save results:
       - `bbox_metadata_filtered.csv`
       - `afterFiltering_summary.png`
   - Summary:
     - Step 1 → Labels remaining: 180; Images: 20,538
     - Step 2 → Labels remaining: 45; Images: 4,441

3. **Question Generation**
   - Run `generateQuestion.py` to:
     - Generate absolute depth questions per object.
     - Output file: `question_output.csv`

4. **Split QA Dataset**
   - Run `split.py` to:
     - Generate train/val JSONs from filtered questions.



@techreport{maji13fine-grained,
   title         = {Fine-Grained Visual Classification of Aircraft},
   author        = {S. Maji and J. Kannala and E. Rahtu
                    and M. Blaschko and A. Vedaldi},
   year          = {2013},
   archivePrefix = {arXiv},
   eprint        = {1306.5151},
   primaryClass  = "cs-cv",
}

@techreport{WahCUB_200_2011,
	Title = ,
	Author = {Wah, C. and Branson, S. and Welinder, P. and Perona, P. and Belongie, S.},
	Year = {2011}
	Institution = {California Institute of Technology},
	Number = {CNS-TR-2011-001}
}

@inproceedings{gupta2019lvis,
  title={{LVIS}: A Dataset for Large Vocabulary Instance Segmentation},
  author={Gupta, Agrim and Dollar, Piotr and Girshick, Ross},
  booktitle={Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition},
  year={2019}
}

@ARTICLE{10056343,
  author={Zhan, Yang and Xiong, Zhitong and Yuan, Yuan},
  journal={IEEE Transactions on Geoscience and Remote Sensing}, 
  title={RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data}, 
  year={2023},
  volume={61},
  number={},
  pages={1-13},
  doi={10.1109/TGRS.2023.3250471}
  }

@InProceedings{MishraBMVC12,
  author    = "Mishra, A. and Alahari, K. and Jawahar, C.~V.",
  title     = "Scene Text Recognition using Higher Order Language Priors",
  booktitle = "BMVC",
  year      = "2012",
}

@InProceedings{singh2019textvqa,
  title={Towards VQA models that can read},
  author={Singh, Amanpreet and Natarajan, Vivek and Shah, Meet and Jiang, Yu
  and Chen, Xinlei and Batra, Dhruv and Parikh, Devi and Rohrbach, Marcus},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2019}
}

@inproceedings{m_Ranjan-etal-CVPR21,
  author = {Viresh Ranjan and Udbhav Sharma and Thu Nguyen and Minh Hoai},
  title = {Learning To Count Everything},
  year = {2021},
  booktitle = {Proceedings of the {IEEE/CVF} Conference on Computer Vision and Pattern Recognition (CVPR)},
}