![Code Researcher](assets/CResearcher-image.png)

# Code Researcher

## 📁 Repo Structure

The repository is organized as follows:

```
├── config/
├── cresearcher/
├── data/
├── patches/
├── prompts/
├── SWE-agent/
├── Agentless/
```


* **`data/`**
  Contains details about the datasets:

  * The `ffmpeg` dataset is included here and ready to use out of the box. It contains subdirectories named `<bug id>`, and containing `<bug id>.json`, `reproducer.testcase` and `build.sh` files. The usage of these files is explained below.
  * `kBenchSyz/200_subset.json` contains the subset of 200 `kBenchSyz` bugs for which we were able to reproduce the crashes. The complete `<bug id>.json` files (and some other metadata) can be found in the kBenchSyz dataset (explained below). In the `200_subset.json` file, for each of the 200 bugs, we provide the bug ID (`id`) and the crash report produced (`crash_report_data`) when we reproduced the crash.

* **`prompts/`**
  Contains the textual prompt templates used by Code Researcher for the Analysis and Synthesis phases.

* **`cresearcher/`**
  The source code for running the Code Researcher agent.

* **`config/`**
  YAML configuration files for different datasets and experiment settings (e.g., `ffmpeg`, `kBenchSyz`). Each config file specifies paths, prompts and parameters required to run the agent. 

* **`patches/`**
  Stores the generated patches used for evaluation, divided into different categories (e.g., crashing, non-crashing etc.). These are used to compute metrics such as CRR (Crash Resolution Rate), Recall, and the All/Any/None metrics reported in the paper. The files are named in the following format: `setting-tool-max_calls-k.json`. Along with this, the `gold_json.json` files contain the ground truth patches for the bugs in the dataset. The `gold_json.json` files are used to compute the average recall and All/Any/None metrics.

* **`SWE-agent/config.yaml`**
  Contains the config file we used for running SWE-agent.

* **`Agentless/`**
  Contains the prompts we used for Agentless.

## 🛠️ Setup

### 🔧 Prerequisites

Before setting up the environment, ensure the following tools are installed:

* **[Git](https://git-scm.com/downloads)** – for cloning the repository and managing submodules.
* **[Conda](https://docs.conda.io/en/latest/miniconda.html)** – to manage the Python environment and dependencies.
* **Latest [`universal-ctags`](https://github.com/universal-ctags/ctags)** – required for parsing code structures.

> \[!NOTE]
> Please ensure the latest version of `ctags` is installed from the [official ctags repository](https://github.com/universal-ctags/ctags).
> Code Researcher is not compatible with the default `ctags` package available via Ubuntu's package manager.



### Environment Setup

Assuming you are in the root directory of this repository,

```bash
conda create -n code-researcher python=3.12
conda activate code-researcher
pip install -r requirements.txt
export PYTHONPATH=$PYTHONPATH:$(pwd)
```
Then export your OpenAI API key 
```shell
export OPENAI_API_KEY={key_here}
```

## 📊 Running on Benchmark Datasets

### kBenchSyz
To use the kBenchSyz benchmark with Code Researcher, you’ll need to begin by downloading and preparing the dataset from the original source repository: [kGym-Kernel-Playground](https://github.com/Alex-Mathai-98/kGym-Kernel-Playground). 

Follow the dataset setup steps as outlined in their README.
You can use the `data/kBenchSyz/200_subset.json` file to filter the ids that we were able to reproduce and to (optionally) replace the crash report present in the dataset with the crash report we generated during reproduction.

Once the dataset is prepared, **download the Linux backport commits from [linux-backport-commits.json](https://github.com/Alex-Mathai-98/kGym-Kernel-Gym/blob/84f89d9f9df34f08323457b2fc67f053383babb4/assets/linux-backport-commits.json) and set the path to this file in `config/kBenchSyz.yaml` under the `agent > backport_commits_json` key.**
Then you can run Code Researcher with:

```bash
python run_code_researcher.py --config config/kBenchSyz.yaml --target_json <path_to_kgym_kernel_playground_repo>/Kernel_Benchmark/<bug id>.json
```

> [!TIP]
> This will create a json file in the `<logs dir>/<bug id>/` directory containing the patches generated and a log file in the same directory.

#### ✅ Validation
To validate the generated patches, follow the setup instructions for Kernel Gym (kGym) as provided in the [Kernel Gym](https://github.com/Alex-Mathai-98/kGym-Kernel-Gym) repository.

> [!NOTE]
> Validation using kGym is compute-intensive and may take significant time and resources to complete.
> For quicker experimentation, we recommend using the ffmpeg dataset instead — it is easier and faster to run, and we provide it directly within this repository.

### FFmpeg

To run Code Researcher on the bugs in this dataset, you don't require any additional setup. You can simply run:

```bash
python run_code_researcher.py --config config/ffmpeg.yaml --target_json data/ffmpeg/<bug id>/<bug id>.json
```

> [!TIP]
> This will create a json file in the `<logs dir>/<bug id>/` directory containing the patches generated and a log file in the same directory.

Now, to test whether a generated patch resolves the crash or not, please create a local clone of the [`ffmpeg` repository](https://git.ffmpeg.org/ffmpeg.git), checkout to the appropriate commit (key `parent_of_fix_commit` in `data/ffmpeg/<bug id>/<bug id>.json`), and apply the patch.
Then follow the instructions [reccommended by OSS-Fuzz](https://google.github.io/oss-fuzz/advanced-topics/reproducing/) for reproducing from a local checkout.
For convenience, in the dataset, we have included the names of the `sanitizer, engine` and `target` in the bug json.
We have also included the reproducer (named `reproducer.testcase`), and a `build.sh` file that only builds the required target (to use this, please copy it to `<oss fuzz directory>/projects/ffmpeg/build.sh`).

## 🧩 Running on a Custom Codebase

To apply Code Researcher to any other codebase, you’ll need to configure two things:

### 1. Create a Config File

Prepare a YAML config similar to those in the `config/` directory. This should include:

* `prompts > preamble`: Initial instruction or setup context for the model.
* `prompts > analysis_examples`: In-context examples for Analysis Phase.
* `prompts > patch_gen_examples`: In-context examples for patch generation.
* `agent > langs` : Comma seperated list of languages to be supported by Code Researcher. Since Code Researcher relies on `ctags`, the languages should correspond to the list obtained by running ctags --list-languages. 
* `agent > repo_url`: URL of the Git repository to analyze. 
* `agent > repo_name`: A short, descriptive identifier for the repository. Used for organizing logs.
* `agent > patch_gen_llm`: Identifier for the language model used in the Synthesis patch generation phase (We suspport `o1`, `gpt-4o`).
* `agent > num_patches`: Number of distinct patch candidates to generate per bug.
* `agent > max_analysis_steps`: Upper bound on the number of analysis steps in the analysis phase.
* `agent > logs_dir`: Directory path where logs should be stored.
* `agent > work_dir`: Directory path where cloned repositories and temporary files will be kept during execution.

> \[!NOTE]
> Ensure that the structure and format of the examples exactly match those used in existing configs — Code Researcher relies on consistent formatting for prompt parsing to function correctly.

### 2. Create a `target_json`

Prepare a bug JSON file containing details about the crash -- similar to the `<bug id>.json` files in `data/ffmpeg/`. This serves as the input bug description for the agent to address. The following fields are necessary:

- `id (str)`: An id for the bug. This is used to create an appropriate folder within the `<logs dir>`.
- `title (str)`: This title is presented to the LLM before presenting the full crash report.
- `parent_of_fix_commit (str)`: Code Researcher checks the repository out to this commit before starting its run.
- `crashes (array)`: This must be non-empty. Each element should be a dictionary with keys
  - `kernel-source-commit (str)`: The commit at which the crash was found. This could be the same as the `parent_of_fix_commit` (as is the case for our FFmpeg dataset) or different (as is the case for kBenchSyz) in case there are too many commits between the fix commit (if known) and the commit at which the crash was found and you want to qualitatively compare the fix generated by Code Researcher to the actual fix.
  - `crash-report-data (str)`: These are the contents of the crash report that are presented to the LLM.

---

Once both are prepared, run Code Researcher with your custom config and `target_json`:

```bash
python run_code_researcher.py --config config/your_config.yaml --target_json path/to/your_target.json
```

## 📈 Results
Run each command in your terminal to generate the respective table:

```bash
# Generates Table 1 (Crash Resolution Rates of different tools) from the main paper
python create_tables.py --table 1

# Generates Table 2 (Average Recall and All/Any/None numbers for different tools) from the main paper
python create_tables.py --table 2

# Generates Table 3 (CRR, Average Recall and All/Any/None numbers for the search_commits ablation) from the main paper
python create_tables.py --table 3

# Generates the Context Filtering Ablation table (Section 5.4)
python create_tables.py --table 4

# Generates results specifically for the FFmpeg dataset corresponding to RQ6 in main paper
python create_tables.py --table ffmpeg
```
