Here we provide a temporary README file for reviewers to navigate our code base. Due to space constraints, we are unable to include results folders (hundreds of MBs) in this repository. Please refer to the main paper for graphs/tables on faithfulness, accuracy, and p-values.

### Pipeline scripts

We provide pipelines to evaluate the accuracy and faithfulness of CoT reasoning in Large Language Models (LLMs). These include:

- `openai_faithfulness_pipeline.py` utilizes the OpenAI API, configured by `openai_faithfulness_config.json`
- `llama_faithfulness_pipeline.py` utilizes the GPU clusters, configured by `llama_faithfulness_config.json`

The remaining scripts can be used to generate ICL/finetuning examples, select examples, and evaluate the faithfulness of LLMs after running the respective ICL/finetuning pipeline. Additionally, intervention can be run via `iti-faithfulness.ipynb`.

### Utils

The utils folder is structured as follows: 
- `utils/data.py` contains the data loader / processors for the datasets
- `utils/faithfulness.py` contains faithfulness metric functions
- `utils/llama.py` contains functions for interacting with LLAMA models via the HuggingFace API or GPU clusters
- `utils/openaiapi.py` contains functions for interacting with the OpenAI API (finetuning, prompting, log probabilities, cost computation, etc.)
- `utils/parsers.py` contains functions for parsing responses from LLMs
- `utils/results.py` contains functions to generate tables/graphs/dictionaries containing necessary results
- `utils/selection.py` contains functions for selecting and constructing examples for finetuning/icl
- `utils/util.py` contains helper functions for generating prompts, constructing save directories, etc.
- `utils/visualize.py` contains functions for visualizing results, including plots