# LogiNumSynth: Synthesizing Joint Logical-Numerical Reasoning Problems for Language Models

This repository hosts the supplementary material for the LogiNumSynth project. Our work introduces a flexible natural language problem synthesizer that synthesizes tasks requiring proficiency in joint logical-numerical reasoning. The repository encompasses data synthesis, model evaluation, and fine-tuning scripts.

This README provides instructions for:
1. **Data Synthesis**: How to synthesize data using LogiNumSynth.
2. **Model Evaluation**: How to evaluate models on synthesized data.
3. **Model Fine-tuning**: How to fine-tune models on the synthesized data.


## 1. How to synthesize data
Data synthesis with LogiNumSynth involves two sequential steps:

1. **Template-based Synthesis**: Use code in `synthesizer/` to synthesize template-based descriptions along with their formal representations. The `resources/` folder provides pools and templates for synthesis.
2. **Natural Language Conversion**: Use code in `nl-tuning/` to convert the template-based descriptions into more natural language descriptions using a large language model.

### Step 1: Synthesize Template-based Descriptions
To synthesize the same dataset configurations as described in our paper, run:
```bash
cd synthesizer && python main.py # for EL-EN, EL-HN, HL-EN and HL-HN
cd synthesizer && python main-train.py # for EL-Train and HL-Train
cd synthesizer && python main-exhl-hn.py # for exHL-HN
```


#### Customizing Synthesis
To customize the synthesis process, refer to the `main*.py` files to modify configurations (detailed in Appendix E.3 of our paper). Here's the minimal code for synthesis:
```python
from synthesizer.pool import PoolFactory
from synthesizer.template import TemplateFactory
from synthesizer.theory import Theory

pool_factory = PoolFactory("../resources/pools.json")
template_factory = TemplateFactory("../resources/templates.json")

entities = pool_factory.get_entity_pool(10)
attributes = pool_factory.get_attribute_pool(15)
relations = pool_factory.get_relation_pool(10)

numerical_hard_expression = {
    "normal": {ConstantExpression: 0, IdentityExpression: 0, LinearExpression: 1, BinaryExpression: 1},
    "binary": {ConstantExpression: 1, IdentityExpression: 1, LinearExpression: 1}
}

theory = Theory(template_factory, 
                entities, 
                attributes, 
                relations, 
                fact_num=15, 
                rule_num=15, 
                depth=3,
                condition_num_interval=(1, 3), 
                expression_weights=numerical_hard_expression, 
                interval=(-100, 100))
data = theory.to_json()
data["id"] = "xxx"
```
Instantiating the `Theory` class synthesizes a sample with template-based descriptions. The parameters are:
- **template_factory**: Factory to load templates from `resources/templates.json`
- **entities, attributes, relations**: Sets of entities, attributes, and relations sampled from the pools
- **fact_num, rule_num**: Number of facts and rules to be synthesized
- **depth**: Depth of the reasoning process
- **condition_num_interval**: Range of the number of conditions in each rule
- **expression_weights**: Weights of different types of numerical expressions
- **interval**: Range of operand values

If you want to extend the pools and templates, please modify the `resources/pools.json` and `resources/templates.json`. If you want to extend the synthesizer (e.g. the numerical expression, logical operators, etc.), please modify the code in the `synthesizer/` folder.


### Step 2: Convert template-based descriptions into more natural language descriptions
To convert the template-based descriptions into more natural language descriptions, please run:
```bash
cd nl-tuning && bash run_llm_tuning.sh
```

### Pre-synthesized Datasets
We have already synthesized several datasets using LogiNumSynth (as described in our paper). These can be found in the `data/` folder, with corresponding few-shot examples in the `prompt/` folder:

**Available Datasets:**
- EL-EN: Easy Logical and Easy Numerical reasoning tasks named as `el-en.jsonl`,
- EL-HN: Easy Logical and Hard Numerical reasoning tasks named as `el-hn.jsonl`,
- HL-EN: Hard Logical and Easy Numerical reasoning tasks named as `hl-en.jsonl`,
- HL-HN: Hard Logical and Hard Numerical reasoning tasks named as `hl-hn.jsonl`,
- exHL-HN: extremely Hard Logical and Hard Numerical reasoning tasks composed of 4 subtasks named as `depth-7.jsonl`, `depth-8.jsonl`, `depth-9.jsonl`, and `depth-10.jsonl`,
- EL-Train: Easy Logical but Hard Numerical reasoning tasks for training named as `train-el.jsonl`,
- EN-Train: Easy Numerical but Hard Logical reasoning tasks for training named as `train-en.jsonl`.


## 2. Model Evaluation on Synthesized Data
Model evaluation involves two steps:
1. **Model Inference**: Use code in `llm-evaluation/` or `llm-evaluation-api/` to evaluate models on the synthesized data.
2. **Output Scoring**: Use code in `answer-conclude/` to score the model outputs for answer accuracy and process accuracy.

### Step 1: Evaluate models on the synthesized data
If you want to evaluate open-source models deployed locally, please configure `llm-evaluation/run_llm_vllm_loop.sh` and run:
```bash
cd llm-evaluation && bash run_llm_vllm_loop.sh
```

If you want to evaluate models via APIs, please configure `llm-evaluation-api/do_normal_call.py` and run:
```bash
cd llm-evaluation-api && python do_normal_call.py
```
You can either use the api provided by `llm-evaluation-api\normal_api.py` or implement it by yourself. Instruction and few-shot examples are provided in `prompt/`.


### Step 2: Score the model outputs in answer accuracy and process accuracy
To score the model outputs in answer accuracy and process accuracy, you first need to call an LLM to structurize the model outputs into a JSON format. To do so, please configure `answer-conclude/run_conclude_batch.sh` and run:
```bash
cd answer-conclude && bash run_conclude_batch.sh
```

Then, you can score the model outputs by running:
```bash
cd answer-conclude && bash run_score.sh
```


## 3. Model Fine-tuning on Synthesized Data
Fine-tuning involves two steps:
1. **Supervised Fine-tuning**: Use code in `sft/` to run supervised fine-tuning (optionally with RecAdam) on the synthesized data.
2. **Benchmark Evaluation**: Use code in `sft-eval/` to evaluate the fine-tuned models on external numerical/logical reasoning benchmarks.

### Step 1: Run supervised fine-tuning (optionally with RecAdam) on the synthesized data
If you want to fine-tune a model on the synthesized data, please configure `sft/train_swanlab.sh` and run:
```bash
cd sft && bash train_swanlab.sh
```

You can enable RecAdam (to mitigate catastrophic forgetting) by setting the flag below to `true`; keep it `false` to use the standard optimizer.
```bash
USE_RECALL_ADAM=false  # set to true to enable RecAdam
```

You can also use SwanLab for experiments tracking and model management. Please edit the following configurations in `sft/train_swanlab.sh`:
```bash
PROJECT_NAME=${2:-"LogiNumSynth"}  # your SwanLab project name
SWANLAB_MODE=${3:-"cloud"}  # cloud, offline, disabled, or local
API_KEY=${4:-""} # your SwanLab API key
```

For advanced training configuration (evaluation/save/generation behavior, batch sizes, etc.), modify `sft/sft.py` directly where training_args is constructed. Example:
```python
# === Configure evaluation/save strategy early (before swanlab.init) ===
training_args.evaluation_strategy = IntervalStrategy.STEPS
training_args.eval_steps = 20
training_args.save_strategy = IntervalStrategy.STEPS
training_args.save_steps = 625
training_args.load_best_model_at_end = True
training_args.metric_for_best_model = "accuracy"
training_args.greater_is_better = True
# === Use generation-based evaluation to avoid accumulating logits ===
training_args.predict_with_generate = True
training_args.generation_max_new_tokens = 4096
training_args.generation_num_beams = 1
training_args.generation_do_sample = False
# Keep eval batch small and clear intermediate tensors quickly
training_args.per_device_eval_batch_size = 8
training_args.eval_accumulation_steps = 1
training_args.dataloader_pin_memory = False
```

### Step 2: Evaluate the fine-tuned models on the external numerical/logical reasoning benchmarks
We provide external benchmarks under `sft-eval/datasets/`, grouped as Logical/Numerical. standard splits: `val.jsonl`, `test.jsonl` (some keep original `.json` where the source format is preserved). Datasets with only a test split (e.g., `mawps`, `aime24`, `rulearena`) include just `test.jsonl`.

Numerical / mathematical benchmarks (paths):
- `sft-eval/datasets/gsm8k/main/val.jsonl`, `test.jsonl`
- `sft-eval/datasets/math/val.jsonl`, `test.jsonl`
- `sft-eval/datasets/mathqa/val.json`, `test.json`
- `sft-eval/datasets/SVAMP/data/val.jsonl`, `test.jsonl`
- `sft-eval/datasets/mawps/test.jsonl`
- `sft-eval/datasets/aime24/test.jsonl`

Formal deductive logical reasoning benchmarks (paths):
- `sft-eval/datasets/ruletaker/data/val.jsonl`, `test.jsonl`
- `sft-eval/datasets/proofwriter/data/val.jsonl`, `test.jsonl`
- `sft-eval/datasets/folio/val.jsonl`, `test.jsonl`
- `sft-eval/datasets/fld/data/val.jsonl`, `test.jsonl`

Complex logical reasoning and joint logical-numerical reasoning benchmarks (paths):
- `sft-eval/datasets/logiqa/val.jsonl`, `test.jsonl`
- `sft-eval/datasets/reclor/val.json`, `test.json`
- `sft-eval/datasets/abductionr/data/val.jsonl`, `test.jsonl`
- `sft-eval/datasets/rulearena/airline.jsonl`, `nba.jsonl`, `tax.jsonl`

If you want to evaluate a fine-tuned model on the external benchmarks, please configure `sft-eval/run_test.sh` and run:
```bash
cd sft-eval && bash run_test.sh
```