These codes are used to reproduce the experimental results presented in the submission titled "Semantic Density: Uncertainty Quantification in Semantic Space for Large Language Models".

Below is a step by step guideline about how to run the codes:

Here are two name lists from which {model_name} and {dataset_name} can be selected in the commands used by the following guidelines.
model_name_list = [Llama-2-13b-hf', 'Llama-2-70b-hf', 'Meta-Llama-3-8B', 'Meta-Llama-3-70B', 'Mistral-7B-v0.1', 'Mixtral-8x7B-v0.1', 'Mixtral-8x22B-v0.1']
dataset_name_list = ['coqa', 'trivia_qa', 'sciq', 'NQ']

1. Setting the environment: (a) use the `environment_llama2_mistral_mixtral.yml` file to create an anaconda environment for all the experiments with Llama-2-13B, Llama-2-70B, Mistral-7B, Mixtral-8x7B and Mixtral-8x22B. (b) replace anaconda3/envs/{your_env_name_for_llama2_mistral_mixtral}/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py with modeling_llama2.py attached (change the file name to "modeling_llama.py"). (c) replace anaconda3/envs/{your_env_name_for_llama2_mistral_mixtral}/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py with modeling_mistral.py attached. (d) replace anaconda3/envs/{your_env_name_for_llama2_mistral_mixtral}/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py with modeling_mixtral.py attached (e) use the `environment_llama3.yml` file to create an anaconda environment for all the experiments with Llama-3-8B and Llama-3-70B. (f) replace anaconda3/envs/{your_env_name_for_llama3}/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py with modeling_llama3.py attached (change the file name to "modeling_llama.py").

2. Data preparation: (a) set the paths for huggingface model and dataset cache in 'config.py'. (b) download the CoQA dataset from https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json, and place it in the {data_dir} specified in 'config.py'. (c) run command `python parse_coqa.py` to parse CoQA dataset. (d) run command `python parse_triviaqa.py --model={model_name}` to download and parse TriviaQA dataset. (e) download SciQ dataset form https://github.com/launchnlp/LitCab/blob/main/sciq/test.txt and put `test.txt` in `{config.data_dir}/sciq`. (f) download NQ dataset from https://github.com/launchnlp/LitCab/blob/main/NQ/test.txt and put `test.txt` in `{config.data_dir}/NQ`. (f) run command `python parse_datasets.py --model={model_name} --dataset={dataset_name}` for parsing SciQ and NQ dataset. 

3. Generate responses: (a) run command `python generate_beam_search_save_all_triviaqa_coqa_cleaned_device.py --num_generations_per_prompt='10' --model={model_name} --fraction_of_data_to_use='0.2'--num_beams='10' --top_p='1.0' --dataset='coqa' --cuda_device={cuda_device_id}` to generate responses for CoQA dataset. (b)run command `python generate_beam_search_save_all_triviaqa_coqa_cleaned_device.py --num_generations_per_prompt='10' --model={model_name} --fraction_of_data_to_use='0.1'--num_beams='10' --top_p='1.0' --dataset='trivia_qa' --cuda_device={cuda_device_id}` to generate responses for TriviaQA dataset. (c) run command `python generate_beam_search_save_all_datasets_cleaned_device.py --num_generations_per_prompt='10' --model={model_name} --fraction_of_data_to_use='1.0'--num_beams='10' --top_p='1.0' --dataset='sciq' --cuda_device={cuda_device_id}` to generate responses for SciQ dataset. (d) run command `python generate_beam_search_save_all_datasets_cleaned_device.py --num_generations_per_prompt='10' --model={model_name} --fraction_of_data_to_use='0.5'--num_beams='10' --top_p='1.0' --dataset='NQ' --cuda_device={cuda_device_id}` to generate responses for NQ dataset.

4. Calculate pair-wise semantic similarities for semantic entropy: run command `python get_semantic_similarities_beam_search_datasets.py --generation_model={model_name} --dataset={dataset_name}`  

5. Calculate likelihood information: (a) run command `python get_likelihoods_beam_search_datasets_temperature.py --evaluation_model={model_name} --generation_model={model_name} --dataset={dataset_name} --cuda_device={cuda_device_id} --temperature=0.1` (b) run command `python get_likelihoods_beam_search_datasets_temperature.py --evaluation_model={model_name} --generation_model={model_name} --dataset={dataset_name} --cuda_device={cuda_device_id} --temperature=1.0`

6. Calculate rouge scores: run command `python calculate_beam_search_rouge_datasets.py --model={model_name} --dataset={dataset_name}`

7. Calculate P(True): run command `python get_prompting_based_uncertainty_beam_search.py --generation_model={model_name} --dataset={dataset_name} --cuda_device={cuda_device_id}` 

8. Calculate semantic density: (a) run command `python get_semantic_density_full_beam_search_unique_datasets_temperature.py --generation_model={model_name} --dataset={dataset_name} --cuda_device={cuda_device_id} --temperature=0.1` (b) run command `python get_semantic_density_full_beam_search_unique_datasets_temperature.py --generation_model={model_name} --dataset={dataset_name} --cuda_device={cuda_device_id} --temperature=1.0`

9. Calculate semantic density with different numbers of reference responses: (a) run command `python get_semantic_density_full_beam_search_unique_datasets_temperature_sample_num.py --generation_model={model_name} --dataset={dataset_name} --cuda_device={cuda_device_id} --temperature=0.1` (b) run command `python get_semantic_density_full_beam_search_unique_datasets_temperature_sample_num.py --generation_model={model_name} --dataset={dataset_name} --cuda_device={cuda_device_id} --temperature=1.0`

10. Calculate AUROC scores for all the uncertainty metrics: (a) run command `python compute_confidence_measure_beam_search_unique_temperature.py --generation_model={model_name} --evaluation_model={model_name} --dataset={dataset_name} --temperature=0.1 --cuda_device={cuda_device_id}` and command `python compute_confidence_measure_beam_search_unique_temperature.py --generation_model={model_name} --evaluation_model={model_name} --dataset={dataset_name} --temperature=1.0 --cuda_device={cuda_device_id}` (b) create a fold named `results` to store auroc results. (c) run command `python analyze_results_semantic_density_full_datasets_temperature.py --dataset={dataset_name} --model={model_name} --temperature=0.1 --cuda_device={cuda_device_id}` and command `python analyze_results_semantic_density_full_datasets_temperature.py --dataset={dataset_name} --model={model_name} --temperature=1.0 --cuda_device={cuda_device_id}` (d) run command `python analyze_results_semantic_density_full_datasets_temperature_sample_num.py  --dataset={dataset_name} --model={model_name} --temperature=0.1 --cuda_device={cuda_device_id} --sample_num=10`

11. Generate results shown in the paper: (a) create a folder named `paper_results` to store table results and a folder named `plots` to save the figures. (b) run command `python results_table_auroc.py --dataset={dataset_name} --temperature=0.1` to generate the results in Table 1. (c) run command `python results_table_auroc_statistical_test.py --temperature=0.1` to generate the results in Table 2. (d) run command `python results_sample_num_auroc.py --dataset={dataset_name} --temperature=0.1 --sample_num=10` and command `python plot_sample_num_auroc.py --dataset={dataset_name} --temperature=0.1 --sample_num=10` to generate the plots in Figure 1. (e) run command `python results_group_auroc_average_over_datasets.py --temperature=0.1` and command `python plot_group_auroc_average_over_datasets.py --temperature=0.1` to generate Figure 2. 
