---

## 🧪 BBH (BIG-Bench Hard) Experiment Repository

### 📘 Project Description

This project explores the performance of large language models on the **BIG-Bench Hard (BBH)** benchmark, with a focus on **Chain-of-Thought (CoT)** generation strategies. The codebase supports end-to-end processing, including data loading, generation, path probability analysis, strategy selection, and evaluation.

---

### 🚀 Key Features

1. **Data Loading**
   Load task data from the BBH dataset, supporting a variety of tasks such as mathematical reasoning, logical deduction, etc.

2. **Model Generation**
   Generate candidate answers using pretrained LLMs with multiple decoding strategies such as Top-k sampling and Fibonacci sampling.

3. **Path Probability Analysis**
   Analyze token- and word-level probabilities of generated paths using multiple scoring methods (e.g., logits, gap, entropy).

4. **Strategy Selection**
   Implement various answer selection strategies like Top-1, Max Path, and Aggregated Path to identify the best final answer.

5. **Evaluation Metrics**
   Compute evaluation scores including BLEU, ROUGE, and MATCH for measuring answer quality.

---

### 🗂 Directory Structure

```bash
project_root/
│   ├── bbh_loader.py               # Data loader for BBH tasks
│   ├── config_bbh.py               # Configuration: model path, task list, decoding strategies, etc.
│   ├── decode_bbh.py               # Core decoding logic and probability computation
│   ├── extract_CoT_bbh.py          # Answer selection strategies and CoT extraction
│   ├── tools_bbh.py                # Utility functions (e.g., answer extraction, filtering)
│   └── bbh_experiment_main.py     # Main entry point for experiments
│
└── data/
    └── BIG-Bench-Hard/
        └── bbh/                    # Place all BBH task JSONs here
```

---

### ⚙️ Installation

Install dependencies using pip:

```bash
pip install torch transformers nltk rouge-score sentence-transformers tqdm
```

---

### ⚙️ Configuration

1. Open `gcot/config_bbh.py`.
2. Set the following paths according to your environment:

```python
MODEL_DIR = '../your_local_model_path'
RUNNING_MODEL = 'Gemma-7B'  # or 'Qwen2.5-32B', etc.
DATA_PATH = '../data/BIG-Bench-Hard'
```

---

### ▶️ How to Run

Run the main experiment script:

```bash
cd gcot/
python bbh_experiment_main.py
```

This will execute the selected tasks (see `RUNNING_TASK` in `config_bbh.py`), evaluate answers using multiple strategies, and print as well as save intermediate and final evaluation results.

---

### 📊 Output

Evaluation results will be saved to:

```bash
result/bbh/
```

and will include:

* Strategy-specific BLEU, ROUGE, and MATCH scores
* Intermediate CoT generations and path probability logs

---

### 🔍 Example Usage

A simple example for loading data and generating answers:

```python
from bbh_loader import BBHDataLoader
from decode_bbh import get_token_k_path_prob_follow_up

# Load a task
loader = BBHDataLoader()
task = loader.get_by_json_name('multistep_arithmetic_two')

# Run decoding
k_responses, gen_probs, follow_probs = get_token_k_path_prob_follow_up(
    model, tokenizer, query="What is 2 + 2?", k=5
)

# Print responses
for res in k_responses:
    print(res)
```

---

