# 📘 SciDA: A Multidisciplinary Benchmark for Numerical Reasoning

---

## 🔍 Abstract

Advancement in Large Language Models (LLMs) reasoning capabilities enables them to solve scientific problems with enhanced efficacy. Thereby, a high-quality benchmark for comprehensive and appropriate assessment holds significance, while existing ones either confront the risk of data contamination or lack involved disciplines.  

To be specific, due to the data source overlap of LLMs training and static benchmark, the keys or number pattern of answers inadvertently memorized (i.e. data contamination), leading to systematic overestimation of their reasoning capabilities, especially numerical reasoning.  

We propose **SciDA**, a multidisciplinary benchmark that consists exclusively of over **1k Olympic-level numerical computation problems**, allowing **randomized numerical initializations** for each inference round to avoid reliance on fixed numerical patterns.  

We conduct a series of experiments with both closed-source and open-source top-performing LLMs, and it is observed that the performance of LLMs drop significantly under random numerical initialization. Thus, we provide truthful and unbiased assessments of the numerical reasoning capabilities of LLMs.

---

## 🚀 Features
- 📊 **1k+ Olympic-level problems** across disciplines  
- 🎲 **Randomized numerical initialization** for each inference  
- 🔍 **Unbiased evaluation** of LLMs’ numerical reasoning  
- ⚡ Ready-to-use scripts for inference and evaluation  

---

## 🛠️ Environment Setup

```bash
# Create environment
conda create -n EncycloBench python=3.10
conda activate EncycloBench

# Install dependencies
pip install -r enviroment.yaml
````

---

## 🤖 Run Inference

1. Modify your `base_url` and `api_key` in `./infer/infer.py`:

   ```python
   base_url = "YOUR BASE URL"
   api_key = "YOUR API KEY"
   ```
2. Update the model and input file path.
3. Run inference:

   ```bash
   python ./infer/infer.py
   ```

👉 The path for random-generated questions can be set via the `save_path` argument.

---

## 📈 Run Evaluation

```bash
python ./eval/eval.py
```

---

## 📂 Project Structure

```

EncycloBench/
│── datasets/                 # SciDA datasets
│   └── SciDA\_v1.jsonl
│
│── eval/                     # Evaluation scripts
│   ├── utils/
│   ├── eval.py
│   └── get\_question\_result.py
│
│── infer/                    # Inference scripts
│   ├── config/
│   ├── utils/
│   └── infer.py
│
│── prompts/                  # Prompt templates
│   ├── eval/
│   ├── infer/
│   └── process\_range.txt
│
│── results/                  # Inference & evaluation results
│
│── README.md
│── enviroment.yaml


```

✨ With **SciDA**, we aim to push forward **robust and fair evaluation** of LLM reasoning capabilities.
