# MCTS framework for LLM agents

## Install requirements
```sh
python3.11 -m venv .venv # python 3.11 or greater is fine
source .venv/bin/activate

# Install Graphviz
brew install graphviz  # for Mac user
sudo apt-get install graphviz  # for Linux user
which dot  # Check if installed

# Install requirements
pip install -e '.[dev]'
```

### For SWE Bench
```
git submodule update --init --recursive
cd SWE-bench
pip install -e .
```

### For Qwen Math RM
```bash
pip install -e '.[dev,gpu]'
```

## API keys setup
Please create an [.env](.env) file in the root of this repo and set the following keys.
Please see also [.env.example](.env.example) for how to write the file.

### Gemini
```
export GEMINI_API_KEY=<YOUR_API_KEY>
```

### OpenAI
```
export OPENAI_API_KEY=<YOUR_API_KEY>
```

### Claude
We use Bedrock for Claude Sonnet. For that we need to setup AWS access key and secret access key:
```
export AWS_ACCESS_KEY_ID=<access key id>
export AWS_SECRET_ACCESS_KEY=<secret access key>
unset AWS_SESSION_TOKEN
```

## Algorithms

### Standard MCTS
Standard MCTS algorithm. For the score function, we use the AlphaZero ([Science paper](https://www.science.org/doi/10.1126/science.aar6404),) style UCT Score:

$Q_i = \frac{W_i}{N_i} (W_i$: `value_sum`, $N_i$: `visit_count`): Exploitation term

$P_i = \frac{\exp(s_i)}{\sum_{j} \exp(s_j)}$: Prior (Softmax score)

$U_i = Q_i + C \cdot P_i \cdot \frac{\sqrt{N_p}}{1 + N_i}$ ($N_p$: The number of visiting the parent node, $C$: Hyperparameter): UCT (PUCT?) score

The number of generated nodes: `#initial_expand_samples` + `#num_simulations` * `#num_expand_samples` * `len(actions)`

```python
mcts_config = MCTSConfig(
    num_simulations=3,
    num_expand_samples=2,
    initial_expand_samples=4,
    actions=("answer",),
)
mcts_algo = build_algo("standard", config=mcts_config, score_func=UCTScore())
```

### Linear UCB

The number of generated nodes: **1** + `#num_simulations`

```python
mcts_config = MCTSConfig(
    num_simulations=3,
    num_expand_samples=None,  # Not affect
    initial_expand_samples=None,  # Not affect
    actions=("answer",),  # Only a single action is allowed
)
mcts_algo = build_algo("linucb", config=mcts_config)
```

### Hierarchical Thompson Sampling

The number of generated nodes: **1** + `#num_simulations`

```python
mcts_config = MCTSConfig(
    num_simulations=3,
    num_expand_samples=None,  # Not affect
    initial_expand_samples=None,  # Not affect
    actions=("answer",),  # Only a single action is allowed
)
mcts_algo = build_algo("thompson", config=mcts_config)
```

### Best First Search
Expand new nodes from the best leaf node.

The number of generated nodes: `#initial_expand_samples` + `#num_simulations` * `#num_expand_samples` * `len(actions)`

```python
mcts_config = MCTSConfig(
    num_simulations=3,
    num_expand_samples=2,
    initial_expand_samples=4,
    actions=("answer",),
)
mcts_algo = build_algo("best-first-search", config=mcts_config)
```

### Tree-of-Thoughts (ToT) BFS
Original Paper: [Tree of Thoughts: Deliberate Problem Solving with Large Language Models](https://proceedings.neurips.cc/paper_files/paper/2023/hash/271db9922b8d1f4dd7aaef84ed5ac703-Abstract-Conference.html) ([arXiv version](https://arxiv.org/abs/2305.10601))

The number of generated nodes: `#initial_expand_samples` + (`#num_simulations` - 1) * (`#num_expand_samples` * `#initial_expand_samples`)

※ The number of nodes becomes smaller than this when `#initial_expand_samples` < `#num_expand_samples`

```python
mcts_config = MCTSConfig(
    num_simulations=3,  # Correspond to step limit `T`
    num_expand_samples=2,  # Correspond to breadth limit `b`
    initial_expand_samples=4,  # Corresponds to size limit `k` (>= `b`)
    actions=("answer",),  # Only a single action is allowed
)
mcts_algo = build_algo("tot-bfs", config=mcts_config)
```

### Multi Armed Bandit UCB
Simple algorithm for Multi Armed Bandit problem (i.e. the arms are LLMs). We use UCB-based algorithm.

UCB Score: $U_i = Q_i + E_i$

$Q_i = \frac{W_i}{N_i}$ ($W_i$: `value_sum`, $N_i$: `visit_count`): Exploitation term

$E_i = \sqrt{\frac{2 \ln{N_p}}{N_i}}$ ($N_p$: `visit_count` of the parent node): Exploration term

The number of generated nodes: `#num_simulations`

```python
mcts_config = MCTSConfig(
    num_simulations=16,
    num_expand_samples=None,  # Not affect
    initial_expand_samples=None,  # Not affect
    actions=("answer",),  # Only a single action is allowed
)
mcts_algo = build_algo("mab-ucb", config=mcts_config)
```
