
# Language Model-Driven Data Pruning Enables Efficient Active Learning

This repository is the official implementation of the paper ;Language Model-Driven Data Pruning Enables Efficient Active Learning'. This repo extends the [AL Toolbox](https://github.com/AIRI-Institute/al_toolbox) and adds support for ActivePrune. It is recommended to go through the documentation of [AL Toolbox](https://github.com/AIRI-Institute/al_toolbox) first to understand the AL setup.

## Requirements


To install requirements:

```setup
pip install .
```

📋  All the datasets are used via the HuggingFace datasets library and will be downloaded automatically during training/evaluation.

## Running ActivePrune

To use ActivePrune with any AL strategy, run the command given below. In this example, we are using the coreset strategy (given in al.strategy) with the agnews dataset. This will automatically download the required dataset and models.

```train
CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES='0' HYDRA_CONFIG_PATH=/home/jovyan/active-learning-qlora/acleto/al_benchmark/configs \
        HYDRA_CONFIG_NAME=al_cls_agnews.yml \
        python /home/jovyan/active-learning-qlora/scripts/run_active_learning.py \
        al.num_queries=5 \
        al.llm_subsampling_kwargs.method="HYBRID" \
        al.llm_subsampling_kwargs.hybrid_weight=0.8 \
        al.llm_subsampling_kwargs.scorer_model_name="gemma" \
        al.llm_subsampling_kwargs.offset=0 \
        al.llm_subsampling_kwargs.upper_threshold_llm_scores=5000 \
        al.strategy='coreset'
```

The code for Perplexity, ASK-LLM & ActivePrune is in the following file:

```
acleto/al4nlp/pool_subsampling_strategies/llm_subsampling.py
``

## Running other pruning strategies:

### Perplexity 

```
CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES='0' HYDRA_CONFIG_PATH=/home/jovyan/active-learning-qlora/acleto/al_benchmark/configs \
        HYDRA_CONFIG_NAME=al_cls_agnews.yml \
        python /home/jovyan/active-learning-qlora/scripts/run_active_learning.py \
        seed=$seed \
        al.num_queries=5 \
        al.llm_subsampling_kwargs.method="BOTTOM_K_PERPLEXITY" \
        al.llm_subsampling_kwargs.scorer_model_name="gemma" \
        al.llm_subsampling_kwargs.offset=0  \
        al.strategy='coreset'
```

### ASK-LLM
``` 
CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES='0' HYDRA_CONFIG_PATH=/home/jovyan/active-learning-qlora/acleto/al_benchmark/configs \
        HYDRA_CONFIG_NAME=al_cls_agnews.yml \
        python /home/jovyan/active-learning-qlora/scripts/run_active_learning.py \
        seed=$seed \
        al.num_queries=5 \
        al.llm_subsampling_kwargs.method="ASK_LLM" \
        al.llm_subsampling_kwargs.scorer_model_name="gemma" \
        al.llm_subsampling_kwargs.offset=0 \
        al.strategy='coreset'
```

### UPS 
```
CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES='0' HYDRA_CONFIG_PATH=/home/jovyan/active-learning-qlora/acleto/al_benchmark/configs \
        HYDRA_CONFIG_NAME=al_cls_agnews.yml \
        python /home/jovyan/active-learning-qlora/scripts/run_active_learning.py \
        seed=$seed \
        al.num_queries=5 \
        al.strategy='lc' \
        al.llm_subsampling_kwargs.method="UPS" \
        al.sampling_type="ups"
```


## Contributing

📋  All contributions welcome! All content in this repository is licensed under the MIT license.
