# Exposing the Achilles’ Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning

Repository to evaulate and run the code for the paper "Exposing the Achilles’ Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning".
## Results
MWP-Incorrect dataset and corresponding output of each model can be found in the `dataset` folder. The results are stored in the following format:
```
dataset
│   GSM8K
    dataFiles
    │   COT
    │   defaultCOT
    |   smallerModel
    results
    |   claude
    |   gpt-35-turbo
    |   GPT4
    |   GPT4O
    |   llama-2-7b-chat
    |   mistral
    |   phi
```

Each dataset folder contains two folder, `datafiles` and `results`. The `dataFiles` folder contains the MWP-Incorrect dataset segreagted into SLM incorrect reasoning steps and default incorrect reasoning steps  The `results` folder contains the results of the evaluation of each model on the MWP-Incorrect dataset.

## Running the code
### COT Creation
Folder is used to run the create COT when reasoning steps are not provided.

### GPT Evaluation
Folder is used to evaluate the models on T1 task on the MWP-Incorrect dataset. To run the code run `runGPT.sh` script.

### GPTFinal Answer
Folder is used to evaluate the models on T2 task on the MWP-Incorrect dataset. To run the code run `runGPT.sh` script.

### GPTMemorization
Folder is used to run memorization tasks on the MWP-Incorrect dataset. To run the code run `runMemorization.sh` script.

### RuleBasedIncorrect
Folder is used to run the rule based incorrect reasoning steps on the MWP-Incorrect dataset.
