## Benchmarking (Unified PhysLogic)

This folder provides a benchmarking pipeline for PhysLogical benchmark

### Requirements
- Python 3.9+
- `sentence-transformers`, `scikit-learn`, `python-dotenv`, `openai`, and (optional) `volcenginesdkarkruntime` if using the Ark batch judger.

Install:
```bash
pip install -r requirements.txt
```

### Environment
Set your API keys and endpoints via environment variables:
- `OPENAI_API_KEY`: API key for the official OpenAI endpoint
- `OPENAI_BASE_URL` (optional): custom base URL if needed
- `DEEPSEEK_API_KEY`, `DEEPSEEK_BASE_URL`, `DEEPSEEK_MODEL`: for DeepSeek batch judging (official OpenAI-compatible)

### Run Benchmarking
```bash
python benchmarking.py --model_id your_model --batch_size 12
```
Outputs are written to `results/<model_id>/{choice,comp_n,comp_e,proof}.json`.

Notes:
- Accuracy (score) is computed only for `choice` and `comp_n`.
- Logic metrics (P, O, F, recall, precision) are computed for all question types.

### Aggregate Results
```bash
python eval_result.py --model_id your_model
```
This prints per-type metrics and overall averages.

### Data Schema
`Datas/PhysLogic.json` entries must include:
- `question`: problem text
- `final_answer`: short final answer (e.g., choice A/B/C/D)
- `logical_nexuses`: list of logical nexus steps
- `logical_nexus_weights`: list of weights matching `logical_nexuses`
- `question_type`: one of `choice`, `comp_n`, `comp_e`, `proof`

### Judge Prompt
The numerical/textual judge prompt for `comp_n` lives at `prompt/LLM_judge.md`.
