# SecVulEval
SecVulEval is a dataset of C/C++ vulnerabilities. The dataset includes 5,867 CVEs, 10,998 vulnerable and 14,442 non-vulnerable functions. Besides other relevant metadata, this dataset includes bug-fix changes at the statement level and contextual information related to the vulnerable statement. The dataset is available at HuggingFace and can be loaded using the following script.
```
from datasets import load_dataset

dataset = load_dataset("arag0rn/SecVulEval", split="train")
```

## Run Vulnerability Detection
To run the vulnerability detection experiments, first install the following packages.
```
torch==2.5.1
transformers==4.47.0
accelerate==1.3.0   # for automatic GPU distribution if using multi-GPU
openai
anthropic
tree-sitter
tree-sitter-c==0.23.4
tree-sitter-cpp==0.23.4
openpyxl
```

Also, have the following variables in your environment.
```
export OPENAI_API_KEY=<your-api-key>
export ANTHROPIC_API_KEY=<your-api-key>
export HF_TOKEN=<your-access-token>
```
or have them in the following files as a JSON with `api_key` as key and the notebook cell will store it in the variable in runtime.-
```
creds/openai_api_key.json
creds/anthropic_api_key.json
creds/hf_access_token.json
```
Then open the `orchestrator.ipynb` notebook. Detailed instruction on how to run the cells and explanation is given in the notebook.

To replicate our results, we have added the `random_subset.json` file used in our experiment. If you want to use different random subsets, run the `random_subset.py` file first. The script will overwrite the `random_subset.json` file. So be careful if you want to keep both files. You can switch between any subsets by changing the `dataset` variable in the notebook.

Note: The first time you run the experiment the context agent may take much longer. This is because it has to extract symbols from their original project. After the first run it will be cached in `symcache.sqlite` so future runs will be faster.
