# Running TinyLLaVA related experiments
## Fine-tuning TinyLLaVA StableLM base (same model as COINCIDE)
Run the following scipt
```Shell
bash scripts/train/train_stable_lm.sh
```
Make sure to set the desired number checkpoints (all results are with 23 checkpoints), can be adjusted by modifying `save_steps` in `bash scripts/trainfinetune.sh`. This file has its arguments set such that it runs for 3 epochs with a `per_device_train_batch_size` and `per_device_eval_batch_size` of 16 and 4, respectively. 
## Generating Trajectories
After we have all the checkpoints saved from fine-tuning, now we will compute the per sample loss/attn_svd values in order to get trajectories. The following script will run all the checkpoints in parallel on a single GPU. And will take care of all the checkpoints.
### Loss & Attention SVD Trajectories
Run the following scipt
```Shell
python run_distributed_trajectories.py --model_path $PATH_TO_ALL_CHECKPOINTS 
```
> can also provide --checkpoints in order to select custom checkpoints look at `run_distributed_trajectories.py` for more details.
> `method` argument can be used for contrlling the method of computation of SVD, can be set to "svd-full" (suggested) or "partial-svd" (only if a largest singular is needed)  (due to legacy reasons). `k` is another argument that can be passed in the `bash scripts/train/trajectories.sh` file, this should be avoided for k>1 when using the "partial_svd" method since the attention matrices become ill defined for this computation. 
## Clustering Trajectories
run the following script
```Shell
python kmeans_clustering.py --metric $METRIC_NAME --metric_dir $OUTPUT_DIR_TO_FINE_TUNED_MODEL/attn_svd_files/ --k $TOP_K_ATTN_SVD_VALUES  # for attn_svd clustering
python kmeans_clustering.py --metric $METRIC_NAME --metric_dir $OUTPUT_DIR_TO_FINE_TUNED_MODEL/loss_files/             # for loss clustering
```
## Data Selection:
Arguments:
1. method: This is for selelcting the methed for data selection. Following are the supported methods for data selection:
    1. method 1: selects the equal numner of samples from each cluster if possible or the complete cluster.
    2. method 2: selects the clusters according to the size (largest first).
    3. method 3: selects a metric (ppl/learnability/MIR) to form the basis of comparision, then computes two probability distributions: over all the clusters and within each clusters, selects the samples on the basis of these probabilities.
    4. method 4 (double clustering): you can select from KMeans/HDBScan to perform inner clustering within the cluster
2. Rest are self-explanatory
```Shell
python cluster_data_selection.py --percentage $PERCENTAGE_OF_DATA_TO_BE_SELECTED --method $CLUSTERING_METHOD --clusters_file_path $CLUSTER_ASSIGNMENT_PATH --metric_file_path $METRIC_FILE_PATH --msa_embeds_dir $LAST_LAYER_DIRECTORY --data_sel_metric_file_path $ATTN_SVD_STABILITY_FILE_PATH --data_file_path $PATH_TO_665K_JSON --save_path $SAVE_PATH
```
> If method == 3, then `--metric_file_path` (path to the metric (ppl/learnability/MIR) file) is a required argument
> If method == 4, then `--data_sel_metric_file_path` (attn_svd trajectory stability values file), `--msa_embeds_dir` (directory containing the last layer embeds) `--inner_clustering` (selects the method for performing inner clsutering ("hdbscan"/"kmeans")) are requried arguments.
