### Script: `fetch_and_generate_prompts.py`

This script fetches integer sequences from the OEIS (Online Encyclopedia of Integer Sequences) and generates code-writing prompts to evaluate the ability of language models to generate these sequences. The prompts are stored in separate directories for easy and hard sequences.

#### Functionality:

- **Sequence Fetching**: 
  The script downloads OEIS sequences using their unique sequence ID and caches them locally for reuse.
  
- **Prompt Generation**: 
  For each sequence, a prompt is generated instructing the language model to write code that outputs the nth element of the sequence.

- **Easy vs. Hard Classification**: 
  Sequences are categorized into "easy" and "hard" based on OEIS keywords. The script collects up to 250 easy and 250 hard sequences.

#### Output:

- Prompts for easy sequences are saved in the `SequenceEasyPrompts` directory.
- Prompts for hard sequences are saved in the `SequenceHardPrompts` directory.

#### How to Run:

1. Ensure you have internet access to fetch sequences from OEIS.
2. Run the script:
   ```bash
   python fetch_and_generate_prompts.py
   ```
3. The script will stop automatically once it collects 250 easy and 250 hard sequences.

---

### Script: `generate_responses.py`

This script generates responses to prompts using different language models via the OpenAI API. It reads prompts from files, submits them to OpenAI’s models, and saves the responses in corresponding directories for later evaluation.

#### Functionality:

- **Prompt-Response Workflow**: 
  The script reads prompts from the `SequenceEasyPrompts` and `SequenceHardPrompts` directories, sends them to OpenAI's models, and saves the responses in respective directories (e.g., `SequenceEasyResponses_gpt-4o/`).
  
- **Model Variety**: 
  The script uses three different models for generating responses: `gpt-4o`, `gpt-4o-mini`, and `o1-mini`.

- **Response Caching**: 
  If a response to a prompt has already been generated, the script loads the cached response to avoid redundant API calls.

#### Output:

- Responses are saved as `.json` files in directories like `SequenceEasyResponses_gpt-4o` and `SequenceHardResponses_o1-mini`, corresponding to the model used.
  
#### How to Run:

1. Ensure you have your OpenAI API key set in the environment variable `OPENAI_API_KEY`.
   ```bash
   export OPENAI_API_KEY="your-api-key-here"
   ```

2. Run the script:
   ```bash
   python generate_responses.py
   ```

3. The script will read the prompts, generate responses using the specified models, and save the results.

---

### Script: `extract_codes.py`

This script extracts Python code from model-generated responses and saves the extracted code into corresponding `.py` files. It processes response files from different models and categories (easy/hard) and outputs the extracted code into structured directories.

#### Functionality:

- **Code Extraction**: 
  The script reads the responses (in `.json` format), extracts the last Python code block found inside triple backticks (```), and saves the extracted code into `.py` files.

- **Directory Mappings**: 
  The script processes responses from multiple models and categories, saving the extracted code to corresponding directories (e.g., responses from `SequenceEasyResponses_gpt-4o` will have their code saved in `SequenceEasyCodes_gpt-4o`).

#### Output:

- Python files with the extracted code are saved in directories like `SequenceEasyCodes_gpt-4o` and `SequenceHardCodes_o1-mini`, with filenames matching the original response file names.

#### How to Run:

1. Ensure you have already generated responses using the previous script.
2. Run the script:
   ```bash
   python extract_codes.py
   ```
3. The script will process each response, extract the Python code, and save it in the appropriate directory.

---

### Script: `analyze_cheating.py`

This script evaluates Python code files generated by language models and determines if the code uses a look-up table, which is considered "cheating" for the benchmark. It uses the OpenAI API to analyze each code file and outputs either a `1` (if cheating is detected) or a `0` (if no cheating is detected).

#### Functionality:

- **Look-up Table Detection**: 
  The script sends the content of each `.py` file to the OpenAI model with a prompt designed to detect whether the code improperly uses a look-up table.

- **Structured Output**: 
  The model is prompted to return a structured output of either `1` (cheating detected) or `0` (no cheating).

- **Retry Mechanism**: 
  In case of failures during the API request, the script retries up to 5 times with a delay between attempts.

#### Output:

- For each code file, the result (`1` or `0`) is saved in a `.cheated` file with the same name as the code file but with a `.cheated` extension.

#### How to Run:

1. Ensure you have your OpenAI API key set in the environment variable `OPENAI_API_KEY`.
   ```bash
   export OPENAI_API_KEY="your-api-key-here"
   ```

2. Ensure that the code files to be analyzed are in the appropriate directories (e.g., `SequenceEasyCodes_gpt-4o`, `SequenceHardCodes_o1-mini`).

3. Run the script:
   ```bash
   python analyze_cheating.py
   ```

4. The script will analyze each code file and save the result (`1` or `0`) in the corresponding `.cheated` file.

---

### Script: `evaluate_sequences.py`

This script evaluates Python code generated by models against integer sequences to measure the correctness of their outputs. It runs the code with varying timeouts and records the accuracy of the models for each sequence.

#### Functionality:

- **Sequence Evaluation**: 
  The script reads sequences from a `stripped` file and evaluates Python code files corresponding to those sequences. Each code file is run with various timeouts, and the accuracy of the output is compared to the expected sequence values.
  
- **Cheating Detection**: 
  If a model has been flagged for using a look-up table (cheating), the script automatically assigns a score of 0.

- **Offset Handling**: 
  Sequences may have an offset value, which is loaded from the OEIS cache, ensuring the correct starting index when evaluating the code.

- **Multiple Timeouts**: 
  The script tests each code file with different timeouts (0.5, 1, 2, and 4 seconds) to measure performance under time constraints.

#### Output:

- Scores are saved in `.score` files in directories like `SequenceEasyScores_gpt-4o` and `SequenceHardScores_o1-mini`. Each score file contains the accuracy as a percentage for the corresponding sequence and timeout.

#### How to Run:

1. Ensure you have sequences in the `stripped` file and Python code files in the appropriate model directories (e.g., `SequenceEasyCodes_gpt-4o`).
2. Run the script:
   ```bash
   python evaluate_sequences.py
   ```
3. The script will evaluate all sequences, generate scores, and save them in the corresponding score directories.

#### Debugging:

- To enable debugging and see detailed information on mismatches, set the `debug` variable to `True` in the `main` function.

---

### Script: `generate_latex_table.py`

This script calculates the average scores and cheating percentages for each model and timeout, and formats the results into a LaTeX table. The script processes scores from both the *easy* and *hard* benchmarks, showing a detailed comparison across models and timeouts.

#### Functionality:

- **Average Score Calculation**: 
  The script computes the average score for each model at different timeout values (0.5, 1, 2, and 4 seconds) based on the `.score` files in the respective directories.
  
- **Cheating Percentage**: 
  It also calculates the percentage of sequences flagged for cheating (via `.cheated` files) for each model and timeout.

- **LaTeX Table Output**: 
  The script outputs the results in LaTeX table format, showing the average score and cheating percentage for each model and timeout across both *easy* and *hard* benchmarks.

#### Output:

- The results are printed as a LaTeX table, with columns for *SequenceEasy* and *SequenceHard*, displaying:
  - **Average score**: The percentage of correct outputs.
  - **Cheating percentage**: The percentage of sequences flagged for cheating.

#### How to Run:

1. Ensure that `.score` files are present in the relevant directories (e.g., `SequenceEasyScores_gpt-4o`, `SequenceHardScores_o1-mini`).
2. Run the script:
   ```bash
   python generate_latex_table.py
   ```
3. The script will output a LaTeX table summarizing the average scores and cheating percentages for each model and timeout.

#### Example LaTeX Table:

```
\begin{table}[h!]
\centering
\begin{tabular}{|l|c|c c|c c|}
\hline
Model & Timeout & \multicolumn{2}{c|}{SequenceEasy} & \multicolumn{2}{c|}{SequenceHard} \\
 & & Avg. Score & \% Cheating & Avg. Score & \% Cheating \\
\hline
... (table rows go here) ...
\hline
\end{tabular}
\caption{Evaluation of Average Scores and Cheating Percentages by Timeout}
\end{table}
```
