## High-level overview

### 0. Directory structure

```
├── data
├── ds_configs
├── prompts
├── results
│   ├── auc_res
│   ├── metrics_to_auc
│   ├── per_sample_result
│   └── subsets
├── scripts
└── utils
```

1. data: contains local json datasets (BIG-bench tasks that had to be generated using the scripts provided in the BIG-bench repo)
2. ds_configs: contains accelerate configuration files for running distributed training
3. prompts: contains prompts for each downstream task used for fine-tuning and evaluating the model
4. results:
- contains model fine-tuned performance results across varying data budgets for all tasks.
- The subdirectories:
    - auc_res: contains task data efficiency (AUC) results for each downstream task
    - metrics_to_auc: contains leave-one-out prediction results mapping task difficulty metrics to task data efficiency
    - per_sample_result: contains per-sample gradient, probability, cosine similarity necessary for deriving task difficulty metrics
    - subsets: contains different task data subsets (low-confidence, high perplexity, etc.) necessary for deriving Cos-Low (gradiet cosine similarity of low-confidence examples)
5. script: contains base model and child model classes for distributed fine-tuning 
6. utils: contains miscellaneous Python scripts for calculating the metrics (task data efficiency, task difficulty metrics, mapping from task difficulty to data efficiency metric)

### 1. Running distributed fine-tuning

One key step in our experiments is fine-tuning the base model on varying data budgets.
An example of running this experiment using **finetune_ds.py** script:

```
# assuming in some HPC environment with GPU acccess already

model_prefix="mistral"
dataset_name="super_glue"
dataset_key="super_glue"
task_key="wic"
for i in 50 100 200 500 1000 2500; do
        accelerate launch --num_processes 2 --config_file ~/ds_configs/fsdp_full_config_h100.yaml finetune_ds.py \
            --warmup_ratio 0.1 \
            --step 500 \
            --dataset_name "$dataset_name" \
            --task "$task_key" \
            --batch_size 4 \
            --grad_accumulation_step 4 \
            --max_seq_len 2048 \
            --model_name "meta-llama/Llama-3.1-8B-Instruct" \
            --checkpoint_dir {name_of_checkpoint_dir} \
            --lr 1e-5 \
            --data_size $i \
            --log_and_save_step 5 \
            --use_flash_attention
 done
```
* dataset_name: dataset name found in prompts_by_task_modified.yaml (e.g. super_glue)
* task: specific task name, keys under the dataset_name (e.g. copa)
* model_name: model path from huggingface / local directory.
* dataset_size: specifies the data budget for training. For evaluation, the fine-tuned performance for dataset_size is logged
* max_seq_len: maximum sequence length to include in the dataset; dataset excludes all sequences with length greater than max_seq_len
* result_path: file path for storing the evaluation result
* use_safetensor: include the flag to use safetensors
* use_flash_attention: use flash attention v2 for efficient compute (available for ampere GPUs)

### 2. Running evaluation code

**eval.py** script evaluates the model performance after being fine-tuned on the specified data size:
To run the script,
```
export PYTHONPATH=$(pwd)
cd scripts/
python eval.py --dataset_name "super_glue" \
               --task "wic" \
               --split "test" \
               --model_name {path_to_model_checkpoint} \
               --data_size {specified_data_size} \
               --max_seq_len 2048 \
               --result_path ../results/llama_wic_full_result_v2.json \
               --use_safetensor \
               --use_flash_attention
```

### 3. Running the per-sample metric calculation script (grad norm, model confidence, cosine similarity)

**calculate_per_sample_metrics.py** calculates per-sample gradient norm, model confidence, or cosine similarity necessary to derive task difficulty metrics (later used to predict task data efficiency (AUC)):

```
cd utils/
python calculate_per_sample_metrics.py \
        --model_name "meta-llama/Llama-3.1-8B-Instruct" \
        --num_sample 2500 \
        --data_split "train" \
        --file_postfix "v2_conf0" \
        --use_peft \
        --subset_path "../results/subsets/conf_decile_0_by_task_complete.json" \
        --cosine_batch_size 32
```

* model_name: the base_model name to get the per-sample metric from
* num_sample: total number of samples in the task dataset for which to calculate the per-sample metric
* data_split: the split of the dataset used for per-sample metric calculation
* file_postfix: file postfix for logging the result
* use_peft: whether to use LoRA adaptors for calculating gradient-related metrics (otherwise, storing the multiple sample gradients from the base model with 8B parameters becomes expensive.)
* subset_path: the subset of task data points used for per-sample metric calculation (e.g. as specified in the script, per-sample metrics are computed low-confidence task examples)
* cosine_batch_size: specifies the batch size within which the per-sample cosine similarity is calculated.


### 4. Running the task data efficiency calculation

**calculate_per_task_auc.py** calculates the task data efficiency using the ground-truth model performance results logged at varying data budgets (from step #1 and #2). 

```
cd utils/
python calculate_per_task_auc.py
```

### 5. Running the task data efficiency calculation

**predict_auc_from_metric.py** predicts the task data efficiency using task difficulty metrics derived from step #3.

```
cd utils/
python predict_auc_from_metric.py
```