# CodeAlignBench 

This project provides a framework for creating and applying code transformation instructions. The framework supports multiple programming languages and allows for both rule-based and LLM-based instruction evaluation.

Note: This is the anonymized version of the repository for ICLR review. Some scripts may not run correctly due to anonymization.

- All instructions can be found in instructions/all_instructions.json.
- A sample of user strings is available in instructions/user_strings.
- Programming tasks couldn't be added due to the size limit of supplementary materials and will be available upon request.

## How to Add a New Instruction

Follow these steps to create a new instruction:

Create a new Python file named `YourInstructionName.py` that inherits from [`BaseInstruction`](BaseInstruction.py:24).

Implement abstract methods.

Write tests.


## Environment Setup

Ensure you have the required dependencies and environment variables:

```bash

# Install using pyproject.toml
pip install -e . 

# alternatively with uv
uv sync
```

## Adding Data
To add the programming questions, create a folder named `questions` and add the livebench question file (from box if you dont have it) as `{language}_question.jsonl`
To add the existing model generations, create a folder named `model_answers`, a subdirectory named `{language}`, and the model answers in the subdirectory as `{model_name}.jsonl`. 
- For example in python we have `model_answers/python/gemini-2.5-pro.jsonl`
- The model names are abbreviated (the date is removed) see `models.py` to see the mappings and add additional ones if they do not exist


## Running the pipeline
An example command:
```bash
python3 -m pipeline.main --code_model gemini-2.0-flash --judge_model claude-sonnet-4 --lang python --k 1 --m 1 --followup
```
- judge_model = model used for judging applicability + verificiation. See `models.py` for abbreviated names
- code_model = model used for generating IF code snippets. See `models.py` for abbreviated names
- lang = programming language. One of python, go, javascript, java, swift
- k = max number of applicable categories to sample per livebench question 
- m = max number of user strings to sample per category 
- followup = flag for the prompt. If present, use the followup prompting context. If not present, use predefined prompting context

