# Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration

> Claim: This code does **NOT** disclose any sensitive or personally identifiable information.

## ⚙️ Installation

1. **Create a Virtual Environment**

    ```bash
    conda create -n specbench python=3.10
    conda activate specbench
    ```

2. **Install Dependencies (optional):**

    ```bash
    # Optional: required only if you plan to use vLLM server
    # If you are relying solely on external APIs, you can skip installing vLLM
    pip install vllm==0.8.5.post1 
    ```

3. **Install specbench:**
   
   ```bash
   pip install -e .
   ```

## 🛠️ Detailed Pipeline

By default, the SpecBench framework runs in a CPU‑only environment. You may run models through a vLLM server or through an external OpenAI‑compatible API.

### Dataset

All data lives in the `data` folder, organized into the scenarios `Biochem`, `Child`, `Code`, `Health`, and `Travel`.

### LLM Setup

#### Option 1: vLLM Server API

You can host either the generation model or the evaluator model on a vLLM server. For example, the evaluator `Qwen3-32B-thinking` corresponds to a vLLM deployment of the base model `Qwen/Qwen3-32B`. To start a server:

```bash
export GPU_CNT=4
export PORT=8001
vllm serve Qwen/Qwen3-32B --dtype auto --tensor-parallel-size $GPU_CNT \
    --api-key your-token-abc123 --port $PORT --gpu-memory-utilization 0.9
```

`--api-key` sets the token that clients will send as the API key. Adjust `GPU_CNT` to match available GPUs and set `PORT` to an open port.

#### Option 2: External API

You can also use an external OpenAI‑compatible API, such as the OpenAI Platform. To use `GPT-4.1` as the evaluator, set the environment variable and run as usual:

```bash
export OPENAI_API_KEY=[YOUR_API_KEY]
```

### Generation

After setting up the API server, you can generate responses from the specified model using a chosen TTD method. Demo scripts live in `scripts/generation/`. The following example corresponds to `scripts/generation/Qwen3-14B-thinking-align3.sh`.

```bash
python src/run.py \
    --mode generation \
    --scenario Child \
    --model Qwen3-14B-thinking \
    --provider vllm-server \
    --method align3 \
    --num_threads 32 \
    --ip $ip \
    --port $port
```

Parameter explanations are as follows.

* `mode`: Either `generation` or `evaluation`.
* `scenario`: Scenario identifier. The loader reads files from `data/$scenario/`.
* `model`: Logical model name used by SpecBench, such as `Qwen3-14B`, `Qwen3-14B-thinking`, or `GPT-4.1`. This value must match a key in `config/config_model.yaml`, where decoding and request parameters are defined.
* `provider`: The API provider used to call the model. Common values are `vllm-server`, `openai`, and `deepseek-ai`.
* `method`: The TTD method used during inference. Supported options include `vanilla`, `zero_think`, `more_think`, `self-refine`, and `align3`, which correspond to the method names described in our paper. Some methods require additional hyperparameters, which you can set in the config files referenced in the `config/README.md`.
* `num_threads`: The number of concurrent worker threads used to submit requests. Increase this value to speed up runs if your server and rate limits allow it.
* `ip`: The hostname or IP address of the vLLM server, for example `localhost` or `127.0.0.1`. This parameter is used only when `provider` is `vllm-server`.
* `port`: The port of the vLLM server, for example `8001`. This parameter is used only when `provider` is `vllm-server`.

Outputs of a generation run are written to `result/$model/$scenario/$method/`:

* `result/$model/$scenario/$method/generate.json`: The model responses that will be consumed by the evaluator. Each entry corresponds to one input item in the scenario and includes the model output.
* `result/$model/$scenario/$method/generate_usage.json`: Aggregated token usage for the run. Use this file to understand cost and throughput for your settings.

For more details on the generation flow, see `src/run.py` and `src/generate_response.py`.

### Evaluation

After completing the generation step, you can proceed to evaluate the produced responses. Demo scripts are provided in `scripts/evaluation/`. The example below corresponds to `scripts/evaluation/Qwen3-14B-thinking-align3.sh`.

```bash
python src/run.py \
    --mode evaluation \
    --scenario Child \
    --model Qwen3-14B-thinking \
    --eval_model GPT-4.1 \
    --eval_model_provider openai \
    --evaluation_type joint \
    --method align3 \
    --num_threads 32 \
    --ip $ip \
    --port $port
```

Most parameters are the same as those in the generation section. The following parameters differ during evaluation.

* `model`: The identifier of the model whose outputs you want to evaluate, such as `Qwen3-14B-thinking` or `DeepSeek-R1-Distill-Llama-8B`. The evaluator does not load this model. This value is only used to locate `result/$model/$scenario/$method/generate.json` produced during generation.
* `eval_model`: The evaluator model, such as `GPT-4.1` or `Qwen3-32B-thinking`. In most experiments, `GPT-4.1` is used as the evaluator. However, during development we also recommend using `Qwen3-32B-thinking` as a lower-cost, locally deployable alternative.
* `eval_model_provider`: The provider used to call `eval_model`, for example `openai` or `vllm-server`.
* `evaluation_type`: Evaluation type, where `joint` evaluates all specifications in one LLM response, while `sequential` scores each specification individually across multiple responses; default is `joint`.
* `ip` and `port`: Used only if the evaluator model is hosted on a vLLM server.

Evaluation outputs are written to `result/$model/$scenario/$method/`.

* `result/$model/$scenario/$method/"$eval_model"_evaluate.json`: Per‑sample evaluation records that include the evaluator’s judgments for each specification.
* `result/$model/$scenario/$method/"$eval_model"_score.json`: Summary metrics for the run. The file reports scores on the unsafe subset, the safe subset, and the full dataset. The summary includes safety scores, behavioral scores, and SAR.

For more details on evaluation logic and scoring, see `src/run.py` and `src/evaluate_response.py`.

### Multi‑Machine Multi‑GPU Setup

If deploying the vLLM server on multiple machines with multiple GPUs, ensure you obtain the IP address of the vLLM server and use it in the `--ip` parameter. This allows the script to generate or evaluate by querying the vLLM server running on a different machine. Ensure that both machines are connected to the same network and the server is accessible via the specified IP.

### Other External APIs

The file `specbench/utils.py` provides connectors for vLLM, OpenAI, and DeepSeek. To integrate another OpenAI‑compatible service, set two fields in the `OpenAIEngine` class.

* `BASE_URL`: The base URL of the API server.
* `API_KEY_STRING`: The name of the environment variable that stores your API key. The engine calls `os.getenv(self.API_KEY_STRING)` to read the key.

After setting these fields, point the `provider` and model name to the new service in your command line arguments and ensure that the environment variable is present in your shell.
