# **Calibrated Decision-Making through Large Language Model-Assisted Retrieval**

This repository contains the implementation of **Calibrated Decision-Making through Large Language Model-Assisted Retrieval (CalibRAG)**. CalibRAG aims to enhance decision-making accuracy by leveraging the nuanced guidance of Retrieval-Augmented Generation (RAG).

---

## **Syntactic Data Generation** 
*(Note: You may skip this step as the dataset is pre-generated and provided at `./data_final`)*

---

## **Baseline Methods**

1. **Generate LLM Outputs (Sampling)**
   Produce sample-based outputs using a large language model.
   ```bash
   python -m experiments.make_lm_outputs --dataset dev
   ```

2. **Evaluate Generated Outputs**
   Assess the quality of outputs generated by the model.
   ```bash
   python -m experiments.api --data_dir <data must have columns x, y, y_pred> --multiple False --type eval
   ```

---

## **CalibRAG Workflow**

1. **Create Open-ended Questions**
   Formulate open-ended questions to encourage comprehensive responses.
   ```bash
   python -m experiments.api --data_dir <data must have columns x> --type oe
   ```

2. **Generate RAG Data**
   Create datasets suitable for Retrieval-Augmented Generation (RAG).
   ```bash
   python -m experiments.retrieve --dataset dev 
   ```

3. **Generate LLM Outputs with Uncertainty Calibration (UC)**
   Generate responses from the LLM, accompanied by uncertainty estimates.
   ```bash
   python -m experiments.make_lm_outputs --model_name="Meta-Llama-3.1-8B-Instruct" --batch_size=32 --uc_type="calibrag" --max_new_tokens=40 --dataset="dev" --inference False
   ```

4. **Simulate Human (LLM) Decision-Making**
   Produce decisions that mimic human reasoning based on LLM-generated content.
   ```bash
   python -m experiments.make_decision --data_dir <data must have columns x, z_pred>
   ```

5. **Evaluate Results**
   Conduct an in-depth evaluation of the syntactic data and decision-making accuracy.
   ```bash
   python -m experiments.api --data_dir <data must have columns x, y, y_pred> --multiple False --type eval
   ```

---

## **Training Methods**

1. **CT-LoRA** 
   Context-tuning with low-rank adaptation (LoRA).
   ```bash
   python -m experiments.train.train_calibration_tune --model_name="Meta-Llama-3.1-8B-Instruct" --batch_size 2 --gradient_accumulation_steps 2 --uc_type "ct"
   ```

2. **CT-Probe**
   Utilize context-tuning along with probing methodologies.
   ```bash
   python -m experiments.train.train_classifier_tune --model_name "Meta-Llama-3.1-8B-Instruct" --batch_size 4 
   ```

3. **CT-Ling (Sampling)**
   Leverage linguistic-based calibration through sampling strategies.
   ```bash
   python -m experiments.train.train_calibration_tune --model_name="Meta-Llama-3.1-8B-Instruct" --batch_size 2 --gradient_accumulation_steps 2 --uc_type ling 
   ```

4. **CT-Number (Sampling)**
   Implement numerical calibration using sampling techniques.
   ```bash
   python -m experiments.train.train_calibration_tune --model_name="Meta-Llama-3.1-8B-Instruct" --batch_size 2 --gradient_accumulation_steps 2 --uc_type number 
   ```

5. **CalibRAG Training**
   Train the CalibRAG model to improve retrieval effectiveness and uncertainty calibration.
   ```bash
   python -m experiments.train.train_reranking_model --model_name="Meta-Llama-3.1-8B-Instruct" --batch_size 2 --gradient_accumulation_steps 2 --with_lora True
   ```

---

## **Test Data Generation**

1. **Create Open-ended Questions**
   Craft open-ended questions that promote exploratory and detailed answers.
   ```bash
   python -m experiments.api --data_dir <data must have columns x> --type oe
   ```

2. **Generate RAG Data**
   Prepare the datasets for Retrieval-Augmented Generation (RAG).
   ```bash
   python -m experiments.retrieve --dataset test
   ```

3. **Produce LLM Outputs with Uncertainty Calibration (UC)**
   Generate model responses alongside uncertainty estimates, enabling robust decision-making.
   ```bash
   python -m experiments.make_lm_outputs --model_name="Meta-Llama-3.1-8B-Instruct" --batch_size=32 --uc_type=<method type: calibrag, ct, ling, number> --max_new_tokens=40 --dataset="test" --query_peft_dir=<your trained model dir> --inference True --with_classifier <True if ct-probe and calibrag>
   ```
   *Optional*: If you wish to regenerate queries, use `regenerate.py`.

4. **Simulate Human (LLM) Decision-Making for Testing**
   Replicate decision-making in an inference setting.
   ```bash
   python -m experiments.make_decision --data_dir <data must have columns x, z_pred> --inference True --uc_type <method type: ct (ct-probe, ct-lora, calibrag), ling, number>
   ```

5. **Evaluate Results**
   Perform a comprehensive analysis of outputs and decision-making on the test dataset.
   ```bash
   python -m experiments.api --data_dir <data must have columns x, y, y_pred> --multiple False --type eval --uc_type <method type: ct (ct-probe, ct-lora, calibrag), ling, number>
   ```