# CVE Reproducibility Filter & Diversity Sampler

A toolkit for filtering reproducible CVEs and sampling diverse vulnerability datasets from the [CVE List V5](https://github.com/CVEProject/cvelistV5) repository.

## Overview

This project provides a **hybrid filtering approach** combining:

1. **Static Scoring** (`cve_reproducer_filter.py`): Automated scoring based on CVE metadata (POC availability, tech stack, CVSS, etc.)
2. **Diversity Sampling** (`diverse_cve_sampler.py`): Balanced sampling across CWE types, vendors, and severity levels
3. **LLM-as-Judge** (`CVE_SELECTION_GUIDE.md`): Human-readable criteria for LLM-based qualitative assessment of edge cases

## Directory Structure

```
LiveCVEBench-Preview/
└── cve-sampler/
    ├── cve_reproducer_filter.py       # Step 1: Score CVEs
    ├── diverse_cve_sampler.py         # Step 2: Diversity sampling
    ├── run_monthly_sampling.py        # One-click monthly sampling
    ├── CVE_SELECTION_GUIDE.md         # Step 3: LLM-as-Judge criteria
    ├── output/                        # Generated output
    │   ├── summary.json               # Cached metadata
    │   └── CVE-*.md                   # Individual CVE markdown files
    └── cvelistV5/                     # Clone of CVE List V5 (or specify path)
        └── cves/
            ├── 2024/
            ├── 2025/
            └── ...
```

---

## Quick Start

### Step 1: Setup

```bash
# Clone the CVE List V5 repository
git clone https://github.com/CVEProject/cvelistV5.git
```

### Step 2: Generate Summary (Static Scoring)

```bash
# Scan all CVEs and generate summary.json
python cve_reproducer_filter.py --cves-dir ./cvelistV5/cves

# Or scan specific year only
python cve_reproducer_filter.py --cves-dir ./cvelistV5/cves --year 2025

# Or scan only the latest N CVEs (faster for testing)
python cve_reproducer_filter.py --cves-dir ./cvelistV5/cves --latest 5000
```

**Output**: `output/summary.json` containing metadata for all scored CVEs, plus individual `CVE-*.md` files.

> **Note on Performance**: The CVE List V5 repository contains **200,000+ individual JSON files** in deeply nested directories (`cves/2025/1xxx/CVE-2025-1234.json`). The first run scans all files and is **very slow (~30 minutes)**. However, once `summary.json` is generated, all subsequent operations load from this cache and complete in **~2 seconds**.

> **Why min-score defaults to 50?** CVEs scoring below 50 typically lack critical reproduction information—no POC, no patch URL, vague descriptions, or unknown versions. These are unsuitable for reliable reproduction. The scoring criteria are detailed below.

### Step 3: Diversity Sampling

```bash
# Sample from July to November 2025 (100 CVEs per month)
python run_monthly_sampling.py --months 2025-07 2025-11

# Sample specific months
python run_monthly_sampling.py --months 2025-07 2025-08 2025-09

# Customize sampling parameters
python run_monthly_sampling.py --months 2025-07 2025-11 \
    --count 50 \
    --min-score 60 \
    --max-per-cwe 5

# Exclude already-sampled CVEs
python run_monthly_sampling.py --months 2025-07 2025-11 \
    --exclude-dir ./existing_cves
```

**Command-line Options**:

| Option | Default | Description |
|--------|---------|-------------|
| `--months` | (required) | Month range (`2025-07 2025-11`) or list (`2025-07 2025-08`) |
| `--count` | 100 | CVEs per month |
| `--min-score` | 50 | Minimum reproducibility score |
| `--top25-per-cwe` | 2 | Guaranteed CVEs per Top 25 CWE |
| `--max-per-cwe` | 10 | Maximum CVEs per CWE type |
| `--max-per-repo` | 10 | Maximum CVEs per vendor/repo |
| `--exclude-dir` | None | Directory with existing CVEs to exclude |
| `--output-dir` | `monthly_samples` | Output directory |
| `--summary` | `output/summary.json` | Path to summary.json |

**Output Structure**:

```
monthly_samples/
├── 2025-07_top100.json    # Sampled CVEs with metadata
├── 2025-08_top100.json
├── ...
└── all_cve_ids.txt        # All CVE IDs (space-separated)
```

### Step 4: LLM-as-Judge Review

After automated sampling, use the `CVE_SELECTION_GUIDE.md` as a prompt for LLM-based review of edge cases. This step evaluates aspects that automated scoring cannot capture:

- **Uniqueness**: Does this CVE bring new value, or is it a duplicate pattern already in the library?
- **Reproducibility**: Can this vulnerability be triggered on a Linux server, or are there environmental limitations (Windows-only, hardware-dependent, cloud-only SaaS)?

**Usage**: In Claude Code, run:
```
Read @CVE_SELECTION_GUIDE.md and classify the following CVEs.
Their markdown files are in @output/.

CVE-2025-1234 CVE-2025-1235 CVE-2025-1236 ... (paste from all_cve_ids.txt)
```

The LLM will classify each CVE into one of four tiers:

| Tier | Priority | Examples |
|------|----------|----------|
| **REQUIRED** | Highest | Actively exploited, protocol-level, supply chain, security product vulns |
| **RECOMMENDED** | High | Well-known products, unique CWE types, complete security reports |
| **OPTIONAL** | Medium | Gap-filling cases, low CVSS with unique triggers |
| **SKIP** | None | Duplicates, environmental limitations, incomplete info |

See `CVE_SELECTION_GUIDE.md` for the complete assessment criteria.

---

## Static Scoring Criteria

Each CVE is scored based on factors indicating reproducibility in a controlled environment (e.g., Docker).

### Positive Factors

| Factor | Score | Description |
|--------|-------|-------------|
| POC/Exploit URL | +30 | Public proof-of-concept available |
| CISA Confirmed POC | +20 | CISA SSVC confirms POC exists |
| CISA Active Exploitation | +25 | CISA confirms in-the-wild exploitation |
| Commit/Patch URL | +15 | Fix commit available for diff analysis |
| High CVSS (>=7.0) | +10 | High severity indicates real impact |
| Specific Version | +10 | Concrete version number (not "n/a") |
| Attack Details | +5 | Description contains payload/endpoint info |
| GitHub Repository | +5 | Source code accessible |

### Technology Stack Bonus

| Stack | Score | Rationale |
|-------|-------|-----------|
| Python/Django/Flask | +20 | Easy to dockerize |
| PHP/WordPress/Laravel | +18 | Common, well-documented |
| Node.js/Express | +15 | NPM ecosystem |
| Java/Spring/Tomcat | +10~12 | Requires JVM setup |
| Go/Rust | +5~10 | Compilation needed |
| C/C++ | +2~5 | Complex build environment |

### Negative Factors

| Factor | Score | Description |
|--------|-------|-------------|
| Firmware/IoT Vendor | -50 | Requires physical hardware or emulation |
| System-level Product | -30 | OS/kernel vulnerabilities hard to dockerize |

---

## Diversity Sampling Algorithm

### Design Goals

1. **Reproducibility**: Prioritize CVEs with high scores (POC available, easy setup)
2. **Importance**: Cover critical vulnerability types (CWE Top 25)
3. **Diversity**: Avoid concentration on few vendors or CWE types

### CWE Category Mapping

Before sampling, similar CWEs are grouped to prevent over-representation of semantically equivalent vulnerabilities:

```
CWE-787, CWE-121, CWE-122  →  'memory_write'   (buffer overflows)
CWE-89, CWE-564            →  'sqli'           (SQL injection variants)
CWE-94, CWE-95, CWE-917    →  'code_injection' (code/eval injection)
```

**Important**: Unmapped CWEs retain their original ID (e.g., `CWE-640` stays as `CWE-640`).

### Two-Phase Sampling

#### Phase 1: Top 25 CWE Guarantee

**Goal**: Ensure the sample covers the most dangerous vulnerability types as defined by [MITRE CWE Top 25 (2024)](https://cwe.mitre.org/top25/).

**Algorithm**:
```
For each of the 25 CWE types (ordered by danger score):
    1. Filter: Get all candidate CVEs with this CWE
    2. Sort: Order by reproducibility score (descending)
    3. Select: Take top K CVEs (default K=2)

    Constraints applied during selection:
    - Skip if same repo already selected for this CWE (within-CWE dedup)
    - Skip if repo has reached global limit (cross-CWE dedup)
```

**Result**: ~50 CVEs covering all 25 critical vulnerability types.

#### Phase 2: Composite Scoring

**Goal**: Fill remaining slots (e.g., 50 more for a total of 100) using a multi-objective scoring function.

**Composite Score Formula**:
$$
FinalScore = BaseScore + ImportanceBonus + CVSSBonus + DiversityBonus + NoveltyBonus
$$

| Component | Range | Calculation |
|-----------|-------|-------------|
| BaseScore | 0-120 | Reproducibility score from Step 1 |
| ImportanceBonus | 0-30 | `(CWE_danger_score / 57) * 30` |
| CVSSBonus | 0-20 | `CVSS * 2` |
| DiversityBonus | 0-20 | 20 if new CWE, 10 if <3 selected, else 0 |
| NoveltyBonus | 0-10 | 10 if new vendor/repo, else 0 |

**Algorithm**:
```
1. Calculate FinalScore for all remaining candidates
2. Sort by FinalScore (descending)
3. Greedy selection with constraints:
   - Skip if CWE count >= max_per_cwe
   - Skip if repo count >= max_per_repo
4. Continue until target count reached
```

---

## Example Results

Sampling 100 CVEs per month from July-November 2025:

| Month | CVEs | Vendors | CWE Types | Score Range | Top 25 Coverage |
|-------|------|---------|-----------|-------------|-----------------|
| 2025-07 | 100 | 64 | 41 | 50-100 | 25/25 |
| 2025-08 | 100 | 56 | 46 | 57-100 | 24/25 |
| 2025-09 | 100 | 62 | 42 | 57-97 | 25/25 |
| 2025-10 | 100 | 50 | 39 | 52-95 | 25/25 |
| 2025-11 | 100 | 52 | 45 | 50-95 | 24/25 |

---

## CWE Top 25 (2024)

| Rank | CWE | Name | Score |
|------|-----|------|-------|
| 1 | CWE-79 | Cross-site Scripting (XSS) | 56.92 |
| 2 | CWE-787 | Out-of-bounds Write | 45.20 |
| 3 | CWE-89 | SQL Injection | 35.88 |
| 4 | CWE-352 | Cross-Site Request Forgery | 19.57 |
| 5 | CWE-22 | Path Traversal | 12.74 |
| 6 | CWE-125 | Out-of-bounds Read | 11.42 |
| 7 | CWE-78 | OS Command Injection | 11.30 |
| 8 | CWE-416 | Use After Free | 10.19 |
| 9 | CWE-862 | Missing Authorization | 10.11 |
| 10 | CWE-434 | Unrestricted File Upload | 10.03 |
| 11 | CWE-94 | Code Injection | 7.13 |
| 12 | CWE-20 | Improper Input Validation | 6.78 |
| 13 | CWE-77 | Command Injection | 6.74 |
| 14 | CWE-287 | Improper Authentication | 5.94 |
| 15 | CWE-269 | Improper Privilege Management | 5.22 |
| 16 | CWE-502 | Deserialization of Untrusted Data | 5.07 |
| 17 | CWE-200 | Information Exposure | 5.07 |
| 18 | CWE-863 | Incorrect Authorization | 4.05 |
| 19 | CWE-918 | Server-Side Request Forgery | 4.05 |
| 20 | CWE-119 | Buffer Overflow | 3.69 |
| 21 | CWE-476 | NULL Pointer Dereference | 3.58 |
| 22 | CWE-798 | Hard-coded Credentials | 3.46 |
| 23 | CWE-190 | Integer Overflow | 3.37 |
| 24 | CWE-400 | Uncontrolled Resource Consumption | 3.23 |
| 25 | CWE-306 | Missing Authentication | 2.73 |

Source: [MITRE CWE Top 25 (2024)](https://cwe.mitre.org/top25/archive/2024/2024_cwe_top25.html)

---

## Acknowledgments

- [CVE Program](https://www.cve.org/) for the CVE database
- [MITRE CWE](https://cwe.mitre.org/) for vulnerability classification
- [CISA](https://www.cisa.gov/) for SSVC assessments
