# CUMath: A Benchmark and Evaluation Framework for LLMs on Mathematical Reasoning in Undergraduate Computational Math

This code accompanies the paper:

> **CUMATH: A Benchmark and Evaluation Framework for LLMs on Mathematical Reasoning in Undergraduate Computational Math**
> *Anonymous authors — under double-blind review*

CUMath contains **2,100** real course problems (Calculus, Linear Algebra, Differential Equations, etc.) with step-by-step solutions and a **two-part evaluation framework**:

1. **Automatic metrics** — final-answer accuracy (string & symbolic) + step-level metrics: exact/semantic Step-F1, Stepwise Reasoning Score (SRS), and Validity–Redundancy (VR).
2. **LLM-as-a-grader** — an LLM provides per-step feedback (1–5) with brief comments; symbolic checks and external CAS are used internally.


## Folder layout

```text
submission_code/
├─ analyze/
│  └─ analyze_stat_data.py     # Counts by topic/subtopic (appendix stats)
├─ evaluation/
│  ├─ automatic_metrics.py     # Accuracy / F1 / SRS / VR
│  └─ llm_as_a_grader.py       # Per-step scores + feedback
├─ evaluation_report/
│  ├─ llm_feedback.json        # Example graded output
│  └─ metrics_report.json      # Example automatic metrics output
├─ generate_answer/
│  ├─ closed_sourced.py
│  ├─ open_sourced.py
│  └─ specialized_math.py
├─ output/
│  └─ example.json             # Minimal example input
└─ README.md
```

> **Note:** Full evaluation outputs **are not included** in `submission_code/` to keep the zip small and anonymous. Paths at the top of scripts are configurable.

---

## Quick start (tiny demo)

A minimal **`output/example.json`** is included to run the pipeline end-to-end.

1. **Automatic metrics**

```bash
python submission_code/evaluation/automatic_metrics.py
```

Writes to `submission_code/evaluation_report/metrics_report.json`

2. **LLM-as-a-grader (per-step feedback)**

```bash
# If using OpenAI, fill in
OPENAI_API_KEY="sk-..."
python submission_code/evaluation/llm_as_a_grader.py
```

Writes to `submission_code/evaluation_report/llm_feedback.json`

3. **Dataset analyzer (counts by topic/subtopic)**

```bash
python submission_code/analyze/analyze_stat_data.py
```

Prints per-topic subtopic counts and overall totals to stdout.

---

## Regenerating full results

1. **Generate model answers (optional)**

```bash
python submission_code/generate_answer/closed_sourced.py
python submission_code/generate_answer/open_sourced.py
python submission_code/generate_answer/specialized_math.py
```

2. **Set paths output** near the top of:

* `evaluation/automatic_metrics.py`
* `evaluation/llm_as_a_grader.py`

3. **Run metrics and grader** as in Quick start. Outputs will be written to the configured report folders (e.g., `evaluation_report/` and `summary/`).

---

## Input format (one example)

```json
{
  "id": "76",
  "topic": "Single Variable Calculus",
  "subtopic": "definite integral",
  "question": "Evaluate the definite integral: ...",
  "answer": "exact form: ln((π + 3√3) / (π - 3√3)) or decimal form: 1.40073120",
  "steps": ["step 1...", "step 2...", "step 3..."],
  "source": "Real-world Assessment",
  "type": "FR",
  "model": "gpt-4.1",
  "provider": "openai",
  "strategy": "zero_shot",
  "temperature": 0.0,
  "model_answer": "Model's full solution text…"
}
```


## What each evaluator does

### `automatic_metrics.py`

* Splits the model solution into steps for evaluation. If 'Step k:' headers exist anywhere in the text, slice between them. Otherwise, fall back to non-empty lines. The returned chunks have the 'Step k:' header removed.
* Computes: string/symbolic accuracy, Step-F1 (exact & semantic), SRS (faithfulness, informativeness-step/chain, coherence, discourse, repetition), and VR.
* Paths are set at the top of the script:

  ```python
  BASED_DIR = Path(r".../submission_code")
  IN_PATH   = BASED_DIR / "output" / "example.json"
  OUT_DIR   = BASED_DIR / "evaluation_report"
  ```

### `llm_as_a_grader.py`

* Splits the model solution into steps and asks an LLM to grade each step (1–5) with a brief comment. If 'Step k:' headers exist anywhere in the text, slice between them. Otherwise, fall back to non-empty lines. The returned chunks have the 'Step k:' header removed.
* In this code, Symbolic/CAS checks are not included in the final JSON — only LLM feedback is stored. However, in our full version, the symbolic outputs are also retained so we can verify where the math is being checked during experiments.
* Reads the API key from the environment. `temperature=0` for stability and reproducibility.

### `analyze_stat_data.py`

* Scans a dataset folder of JSON files; prints per-topic subtopic counts and overall totals.
* Configure:

  ```python
  DATASET_DIR = Path(".../dataset")
  ```


## Requirements

**Python**
- Python ≥ 3.9

**Core (used by automatic metrics + analyzer)**
- `torch`
- `transformers`
- `sympy`
- `numpy`
- `tqdm`
- `pandas`
- `matplotlib`

**Closed-source generators (OpenAI / Anthropic)**
- `openai`
- `anthropic`
- `tenacity`
- `requests`

**Open-source & specialized-math generators (HF Inference API)**
- `huggingface_hub`
- `tenacity`
- `requests`

### Install

Minimal for metrics/analyzer:
```bash
pip install torch transformers sympy numpy pandas matplotlib tqdm
````

Closed-source generation:

```bash
pip install openai anthropic tenacity requests
```

Open-source / specialized-math generation:

```bash
pip install huggingface_hub tenacity requests
```


## Run on Colab

Mount Drive and set paths:

```python
from google.colab import drive
drive.mount("/content/drive", force_remount=True)
```
Then edit script path constants to start with /content/drive/MyDrive/...



## Reproducibility

* **Deterministic metrics:** accuracy/F1/SRS/VR are deterministic given the same inputs.
* **LLM grading:** `temperature=0` (already set); minor variation may remain across providers/models.
* **Symbolic checks:** robust LaTeX to SymPy normalization; both string and symbolic equivalence are considered.



## Ethics & privacy

“Real-world assessment” (quizzes/exams/homework) sources are normalized to remove identifying details.



## Citation

To be added upon publication.