***This is the code an LLM filter function POC***
# Installation
To install:
	
- First create your environment:	
	- Create new environment and install dependencies
		```bash
			conda env create --name <myenv> --file=environment.yml
		```
	- Activate the environment
	
		```bash
		conda activate <myenv>
		```
	- Install pytorch
	
		```bash
		pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118 
		```

- If you wish to run the model on a local version of Llama-2,
  download the model [here](https://llama.meta.com/llama-downloads/)
  (Make sure the weights are in a sub-directory in the same directory as Testbed.py named llama-2-7b-chat)
# Default Dataset - BeaverTails
-  [BeaverTails](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) 

### Default Models
- [Llama-2 Uncensored](https://huggingface.co/georgesung/llama2_7b_chat_uncensored)
- [Llama-3](https://huggingface.co/stabilityai/stable-diffusion-3-medium)
- [Dolphin (Llama 3 Uncensored)](https://huggingface.co/cognitivecomputations/dolphin-2.9-llama3-8b)
- [Mistral](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)

# Running the Method
- To run the method on the default dataset with an uncensored llama-2 model and compare results:
***torchrun Testbed.py -d \<Output Directory\> -a 0.98 0 -m uncensored -sn 100***

- Notes:
	- The program will create the directory given in the argument -d if it does not exist
	- The program will create a directory for the final .csv files, for later analysis
	- The program automatically runs analysis on the output, but it is not guaranteed to work all the time (for example, a dependency on Google/OpenAI APIs)
- Arguments:

| flag        | alias             | help                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | default     |     |
| ----------- | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------- | --- |
| -d          | --output_dir      | the directory where you want the evaluation files to be written to                                                                                                                                                                                                                                                                                                                                                                                                                                               | -           |     |
| -a          | --alpha           | the list of alphas you wish to test on                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | -           |     |
| -m          | --model_card      | the model to run the dataset on. <br>Options: <br>'mistral' (mistralai/Mistral-7B-Instruct-v0.3)<br>'uncensored' (georgesung/llama2_7b_chat_uncensored)<br>'daredevil' (mlabonne/Daredevil-8B-abliterated)<br>'llama3' (meta-llama/Meta-Llama-3-8B-Instruct)<br>'dolphin' (cognitivecomputations/dolphin-2.9-llama3-8b)<br>'censored' (local llama2 model)<br>Other - insert any huggingface LM (not guaranteed to work out of the box as every model has it's own input/chat format and ways to tokenize input) | -           |     |
| -p          | --partition       | the partition of the dataset to use in the current job (meant for multiple jobs on the same dataset)                                                                                                                                                                                                                                                                                                                                                                                                             | 0           |     |
| -pn         | --partition_num   | the number of partitions in total                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | 1           |     |
| -sn         | --sample_num      | number of samples to evaluate                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | 100         |     |
| -data       | --dataset         | the dataset to use (defaults to the PKU-Alignment/BeaverTails dataset)                                                                                                                                                                                                                                                                                                                                                                                                                                           | Beavertails |     |
| -negative   | --negative_prompt | negative prompts to use for evaluation. If you wish to use the default negative prompts, this arg should be 'default' or empty                                                                                                                                                                                                                                                                                                                                                                                   | default     |     |
| -neg_custom | --negative_custom | custom list of negative prompts to use                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | None        |     |

example run command: 

***torchrun Testbed.py -d eval_output_samples/dolphin_2 -a 0.981 0 -m dolphin -sn 200 -p 1 -pn 2 -data truthfulqa/truthfulqa -negative custom -neg_custom cows chickens bugs*** 