# CAPDELS: The First Astronomical Image Description Dataset

## Dataset Overview

CAPDELS is a dataset containing descriptive captions of galaxy images, derived from the Galaxy Zoo CANDELS [1] dataset. The dataset provides multiple LLM generated captions for each galaxy, describing its morphological features and structure.

## Dataset Structure

The dataset is divided into three JSON files:

1. **train_captions.json**: Contains 480 captions and 160 unique galaxies
2. **validation_captions.json**: Contains 4344 captions and 1448 unique galaxies
3. **test_captions.json**: Contains 1311 captions and 437 unique galaxies

Each entry in the dataset is structured as follows:
```json
{
    "GALAXY_ID": [
        "Caption 1",
        "Caption 2",
        "Caption 3"
    ]
}
```

Where:
- `GALAXY_ID`: A unique identifier for each galaxy (e.g., "UDS_6166", "GDS_3193", "COS_4318"). These IDs are same as the image filenames (with '.jpg' suffix).
- Each galaxy has three different captions describing its appearance and features

## Images

Unfortunately, due to the supplemental material size limit, we cannot share the galaxy images. Nonetheless, one can refer to `mwalmsley/gz_candels` dataset repository on Hugging Face to access images.

## Citations

[1]: Simmons, B. D., et al. ‘Galaxy Zoo: Quantitative Visual Morphological Classifications for 48 000 Galaxies from CANDELS★’. Monthly Notices of the Royal Astronomical Society, vol. 464, no. 4, 10 2016, pp. 4420–4447, https://doi.org/10.1093/mnras/stw2587.


## Licence

This dataset is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) licence. 

For more details, see the [LICENCE](LICENCE) file or visit [Creative Commons](https://creativecommons.org/licenses/by-nc-sa/4.0/).