# Multi-LLM 

## Running `main_our_pipeline_{dataset}`

This guide explains how to prepare and run the pipeline for each dataset. Replace `{dataset}` with the name of the target dataset (e.g., `gsm_symbolic`, `gsm_plus`, `metamath`, `nasa_history`).

### 1) Choose parameters
- `threshold_weak` — decision threshold for the weak LLM.
- `threshold_strong` — decision threshold for the strong LLM.
- `threshold_similarity` — similarity-score cutoff used to decide whether two problems are “similar.” The current default value is 0. Future work will focus on refining the criterion for selecting similar questions.
- `similar_problems` — how many similar problems to retrieve per query in the pipeline.

### 2) Initialize the RAG (calibration-only, optional)
If you want the RAG to include **only** calibration strategies and exclude strategies from previous pipeline runs, do **one** of the following:
- **Delete** the previous results in `{prefix}_rag_database`, **or**
- **Change** the pipeline’s `data_dir` to point to a fresh directory before running.

---

## How to set each parameter

### `threshold_weak`
1. Run `main_single_weak_gpt3.5turbo_llm_{dataset}` with your desired calibration dataset and the prompt you will actually use.
2. Use the `find_threshold` function in `finetune_hyperparameter.ipynb` to compute the threshold.

### `threshold_strong`
- Same procedure as `threshold_weak`.

## Notes
- Keep the calibration dataset and prompts consistent between threshold estimation and the pipeline for best results.
- To re-run with a different strategy store, point `data_dir` at a different directory or clean previous outputs.
