# Folder for supporting analysis scripts

- [Strong lens retrieval](#strong-lens-retrieval)
  - [Data preparation](#data-preparation)
  - [Embedding data](#embedding-data)
- [Galaxy Physical Properties Analysis](#galaxy-physical-properties-analysis)
  - [Data preparation](#data-preparation-1)
  - [Running Model Inference](#running-model-inference)
- [GalaxyZoo DECaLS Retrieval](#galaxyzoo-decals-retrieval)
  - [Data preparation](#data-preparation-2)
  - [Embedding data](#embedding-data-1)
- [Galaxy Zoo 10 (GZ10) Analysis](#galaxy-zoo-10-gz10-analysis)
- [Gaia Cross-matching](#gaia-cross-matching)
- [Stellar properties](#stellar-properties)
- [Supporting Files](#supporting-files)

## Strong lens retrieval

### Data preparation

Run this script at first to build a table of tokenized data that matches
selection cuts applied in the HSC paper to define a parent sample:

```bash
python scripts/preprocess_lens.py
```

This script generates a file named `lens_parent_sample_v1.fits` in the `data` folder. The file contains tokenized data from the HSC and Legacy Survey datasets, crossmatched and filtered according to the specified criteria.

### Embedding data

To embed the data using a pre-trained model, run the following script:

```bash
python scripts/data_lens_embed.py
```

This script reads the `lens_parent_sample_v1.fits` file, embeds the data using the specified model, and saves the results to a new file named `lens_parent_sample_v1_embedded_oct24_base.hdf5` in the `data` folder. The embedded data includes two sets of embeddings: `embeddings_ls` for the Legacy Survey data and `embeddings_hsc` for the HSC data.

**Arguments**:

- `--dset_path`: Path to the dataset file to be embedded (default: `data/lens_parent_sample_v1.fits`).
- `--model_path`: Path to the pre-trained model (default: `/path/to/pretrained/model`).
- `--save_path`: Path to save the embedded dataset (default: `data/lens_parent_sample_v1_embedded_oct24_base.hdf5`).
- `--batch_size`: Batch size for processing the data (default: 256).

**Example**:

```bash
python scripts/data_lens_embed.py --dset_path data/lens_parent_sample_v1.fits --model_path /path/to/pretrained/model --save_path data/lens_parent_sample_v1_embedded_oct24_base.hdf5 --batch_size 256
```

## Galaxy Physical Properties Analysis

### Data preparation

The `preprocess_provabgs.py` script downloads and preprocesses the PROVABGS dataset. Specifically, it uses the PROVABGS Bayesian SED models to calculate best fit parameters for stellar mass, metallicity, age, and star formation rate. Metallicity is converted to log metallicity and star formation rate to sSFR, and the data is stored in the `data` folder.

```bash
python scripts/preprocess_provabgs.py
```

The `data_provabgs_xmatch.py` script generates crossmatches between the DESI spectra, PROVABGS dataset and other surveys (Legacy Survey and HSC). It tokenizes the data on the fly and exports the results as a fits AstroPy table to the `data` folder.

```bash
python scripts/data_provabgs_xmatch.py
```

This will generate the following files:

```
data/
├── provabgs_legacysurvey_train_vX.fits
├── provabgs_legacysurvey_eval_vX.fits
├── provabgs_hscwide_train_vX.fits
└── provabgs_hscwide_eval_vX.fits
```

where X is a version number.

### Running Model Inference

The `run_provabgs_eval.py` script is designed to run inference using pre-trained models on a catalog of data and save the results for later analysis and plotting. This script processes the data in batches, applies the model, and saves the predictions in a specified output directory.

#### Usage

To run the script, use the following command:

```bash
python scripts/run_provabgs_eval.py --wandb_csv_file <path_to_csv_file> [--overwrite]
```

**Arguments**:
--wandb_csv_file: Path to the CSV file containing the run IDs of the models to be evaluated. This file should have a column named ID with the run IDs.
--overwrite: If specified, existing output files will be overwritten.
**Example**:

```bash
python scripts/run_provabgs_eval.py --wandb_csv_file scripts/csv_runs/provabgs_runs_v2.csv --overwrite
```

## GalaxyZoo DECaLS Retrieval

### Data preparation

The `data_gzdecals_xmatch.py` script generates crossmatches between the Legacy Survey images and photometry and the GalaxyZoo DECaLS volunteer labels. It tokenizes the data on the fly and exports the results as a fits AstroPy table to the `data` folder.

```bash
python scripts/data_gzdecals_xmatch.py
```

This will generate the following files:

```
data/
├──gz5_legacysurvey_matches.hdf5
```

### Embedding data

To embed the GalaxyZoo DECaLS data using a pre-trained model, run:

```bash
python scripts/embed_gzdecals.py
```

This script reads the GalaxyZoo DECaLS crossmatch data, embeds it using the specified model, and saves the embeddings to a new file.

**Arguments**:

- `--dset_path`: Path to the dataset file to be embedded (default: `./data/gz5_legacysurvey_matches_mp.hdf5`)
- `--model_path`: Path to the pre-trained model (default: `data/aion/dec24/large`)
- `--save_path`: Path to save the embedded dataset (default: `./data/gz5_large_embedded.hdf5`)
- `--batch_size`: Batch size for processing the data (default: 256)

**Example**:

```bash
python scripts/embed_gzdecals.py --dset_path data/gz5_legacysurvey_matches.hdf5 --model_path /path/to/model --save_path data/gz5_embedded.hdf5
```

## Galaxy Zoo 10 (GZ10) Analysis

The `data_gz10_xmatch.py` script generates crossmatches between the Galaxy Zoo 10 dataset and other surveys (Legacy Survey and HSC). It tokenizes the data on the fly and exports the results as fits tables.

```bash
python scripts/data_gz10_xmatch.py
```

This script:
- Crossmatches GZ10 data with Legacy Survey and/or HSC data
- Tokenizes the matched data using appropriate tokenizers
- Supports both tokenized and raw image outputs (controlled by VERSION parameter in the script)
- Exports results as FITS files

Output files:
```
data/
├── gz10_legacysurvey_vX.fits
└── gz10_hsc_vX.fits
```

where X is the version number specified in the script.

## Gaia Cross-matching

The `3wayxmatch.py` script performs three-way crossmatching between Gaia parallax sample data and spectroscopic surveys (SDSS and DESI). This is useful for studies requiring both astrometric and spectroscopic information.

```bash
python scripts/3wayxmatch.py
```

This script:
- Crossmatches Gaia parallax sample with SDSS and DESI spectra
- Extracts object IDs and spectra from the matched sources
- Processes data for specific healpix regions
- Saves results as pickle files

Output files:
```
data/
├── gaia_x_sdss_v1.pkl
└── gaia_x_desi_v1.pkl
```

## Stellar properties

Download the catalog at `https://doi.org/10.12149/101456` which provides estimates of stellar properties.

Run `./data_desiddpayne_xmatch.py`, updating data paths as necessary.

Train models in `aion_icml/configs/benchmarks/desiddpayne` and `aion_icml/configs/benchmarks/desiddpayne_scaling`.

Update `./csv_runs/desiddpayne_runs_v1.csv` and `./csv_runs/desiddpayne_scaling_runs_v1.csv` with the run IDs of the trained models.

Run `./run_desiddpayne_eval.py`, passing the path to the appropriate CSV file from the previous step as an argument. Note the output path where results will be written.

Run `./make_desiddpayne_token_baseline.py` to generate token-input baselines based on XGBoost models. You will need to pass in a `|`-separated CSV file (see `./csv_runs/desiddpayne_baselines_v1.csv` for an example) with the format `name,input_fields,num_examples` where `input_fields` is `,` separated and specifies the inputs you want to use for prediction. Use `-1` for `num_examples` to use all examples.

At this point you will now have a bunch of `.fits` files with the results of the evaluation for the various models.

You can then run `./analyze_desiddpayne_runs.py` to generate plots comparing the performance of the models and to compile a `.json` file with performance in terms of $R^2$ and standard deviation in residuals.

## Supporting Files

### Utility Scripts

- **`utils.py`**: Contains utility functions for getting tokenizers based on dataset names
- **`tokenizers.yaml`**: Configuration file defining tokenizer settings for different data types and surveys

### SQL Queries

- **`hsc_pdr3_quey.sql`**: SQL query for extracting data from HSC PDR3 catalog, used in the lens retrieval pipeline

### Additional Processing Scripts

#### PROVABGS Image Addition

The `add_images_to_provabgs_xmatch.py` script adds raw image data to existing PROVABGS crossmatch files:

```bash
python scripts/add_images_to_provabgs_xmatch.py
```

This script:
- Loads existing PROVABGS-Legacy Survey crossmatch files
- Extracts and adds raw image data for each matched source
- Creates new files with `_w_image` suffix containing both the original data and images
- Processes both training and evaluation splits separately

This is useful when you need the actual image arrays in addition to the tokenized representations for downstream analysis.

