# SWE-eval 🎯

This repository contains the code implementation of the multi-dimensional evaluation framework proposed in the paper [SWE-eval: Trajectory-Enhanced Multi-Dimensional Evaluation for Agent-Driven GitHub Issue Resolution]().

## Overview ✨

This framework uses high-performance LLMs to replace the multiple small models' scoring logic in the original work to evaluate multi-turn conversation trajectories of agent systems.

> **📝 Note on Dataset**: Due to the large size of the complete dataset, this repository currently contains only demo data for demonstration purposes. The full dataset will be made publicly available in the future.

## Repository Structure 📁

```
|--data
|--|--trajectory_original         # Original trajectory data files (pre-categorized)
|--|--trajectory-evaluation_by_llm # Statistical ReCEval metric values
|--|--patch-evaluation            # Legacy evaluation data (used for table generation)
|--|--traj-evaluation            # Trajectory evaluation results

|--|--temp                       # Intermediate data (generated and used during runtime)
|--|--|--summary_data            # Summaries of each trajectory
|--|--|--receval_result          # Metric values calculated through ReCEval

|--Evaluate_Trajectory_By_LLM
|--|--trajectory_summary.py      # Summarizes trajectories.jsonl into series of .json files
|--|--receval_modification.py    # Calls LLM to score and evaluate summary.json
|--|--result_statistic.py        # Aggregates scoring results from previous script

|--|--llm_clients               # LLM API access definitions
|--|--|--BaseLLMClient.py       # Stores API endpoint and secret key information
|--|--|--DSV3Client.py          # DeepSeek-V3 client

|--|--utils                     # Utility functions for evaluation
|--|--|--split_instance_id.py   # Splits model name and pure ID from instance_id
|--|--|--safe_statistic.py      # Robust statistical script for input data

|--|--multi_main
|--|--|--multi_summary.py       # Multi-threaded summary processing
|--|--|--multi_receval.py       # Multi-threaded ReCEval processing

|--Evaluate_Trajectory_By_Rule   # Rule-based trajectory evaluation
|--|--trajectory2report-*.py    # Generate reports for different agent systems
|--|--preprocessing_merge_traj_json_into_a_jsonl/ # Preprocessing utilities
|--|--split_traj_by_patch_correctness/ # Split trajectories by patch correctness

|--swebench                     # SWE-bench related utilities
```

## Prerequisites 📋

- Python 3.8+
- Required packages (install via `pip install -r requirements.txt`):
  - requests~=2.32.4
  - openai~=1.93.0

## Setup 🚀

1. **Clone the repository:**
   ```bash
   git clone <repository-url>
   cd SWE-eval
   ```

2. **Install dependencies:**
   ```bash
   pip install -r requirements.txt
   ```

3. **Configure LLM API (Required for Part 3):**
   - Set environment variables:
     ```bash
     export OPENAI_API_KEY="your-api-key"
     export OPENAI_BASE_URL="your-api-base-url"
     ```
   - Or modify `Evaluate_Trajectory_By_LLM/llm_clients/BaseLLMClient.py` directly to include your API credentials.

## Usage 💫

> **📝 Note**: All Python commands in this guide can be run directly from the project root directory without needing to `cd` into subdirectories.

### Part 1: SWE-bench Patch Evaluation 🔧

This part uses the standard SWE-bench evaluation pipeline to assess patch correctness.

**To run patch evaluation:**
```bash
python swebench/harness/run_evaluation.py --dataset_name <dataset> --predictions_path <predictions_file>
```

### Part 2: Rule-Based Trajectory Evaluation 📊

This module evaluates agent trajectories using predefined rules and statistical patterns without requiring LLM calls. It analyzes conversation flows, detects problematic patterns, and generates comprehensive reports on agent behavior.

#### Features 🎪

The rule-based evaluation system provides:

- **Loop Detection** 🔄: Identifies when agents get stuck repeating the same actions or responses
- **Turn Analysis** 💬: Analyzes conversation turn patterns and interaction flows
- **Statistical Metrics** 📈: Calculates quantitative measures of trajectory quality
- **Classification Integration** 🏷️: Cross-references with patch evaluation results (resolved/unresolved/empty patch)
- **Agent-Specific Processing** 🤖: Handles different trajectory formats from various agent systems

#### Supported Agent Systems 🤖

1. **Moatless** 💻: Conversation-based code assistance agents
2. **OpenHands** 🙌: Multi-modal software development agents  
3. **SWE-agent** 🔧: Software engineering specialized agents

#### Input Data Requirements 📝

Before running rule-based evaluation, ensure you have:

- **Trajectory files** 📄: Agent conversation logs in JSON/JSONL format
- **Patch evaluation results** 🏁: Classification of issues as resolved/unresolved/empty patch
- **Proper file structure** 🗂️: Organized according to agent type

#### Step-by-Step Usage 🎯

**Step 1: Prepare Input Data** 📋

If you have separate trajectory JSON files that need to be merged:

```bash
# For OpenHands trajectories
python Evaluate_Trajectory_By_Rule/preprocessing_merge_traj_json_into_a_jsonl/merge-traj-openhands.py --input_dir <trajectory_folder> --output_dir <output_folder>

# For SWE-agent trajectories  
python Evaluate_Trajectory_By_Rule/preprocessing_merge_traj_json_into_a_jsonl/merge-traj-swe-agent.py --input_dir <trajectory_folder> --output_dir <output_folder>
```

**Step 2: Split Trajectories by Patch Correctness (Optional)** 📂

Organize trajectories based on patch evaluation results:

```bash
# Split trajectories into resolved/unresolved/empty_patch categories
python Evaluate_Trajectory_By_Rule/split_traj_by_patch_correctness/split.py

# Generate parameters for trajectory reporting
python Evaluate_Trajectory_By_Rule/split_traj_by_patch_correctness/generate-parameter-of-trajectory2report.py
```

This creates separate files for:
- `*_resolved_ids.jsonl` ✅: Successfully resolved issues
- `*_unresolved_ids.jsonl` ❌: Failed to resolve issues  
- `*_empty_patch_ids.jsonl` ⚪: No meaningful patch generated

**Step 3: Run Rule-Based Analysis** 🔍

Execute the appropriate script for your agent system:

1. **For Moatless agents:**
   ```bash
   python Evaluate_Trajectory_By_Rule/trajectory2report-moatless.py
   ```

2. **For OpenHands agents:**
   ```bash
   python Evaluate_Trajectory_By_Rule/trajectory2report-openhands.py
   ```

3. **For SWE-agent:**
   ```bash
   python Evaluate_Trajectory_By_Rule/trajectory2report-SWE-agent.py
   ```

#### Configuration Options

Each trajectory processing script supports configuration through editing the script directly:

- **Input file paths**: Modify `jsonl_file_list` in the `main()` function
- **Output directory**: Change `output_path` variable
- **Evaluation criteria**: Adjust detection thresholds and rules
- **Classification files**: Update paths to patch evaluation results

#### Output Reports

The rule-based evaluation generates detailed reports including:

**Statistical Summary:**
- Total trajectories processed
- Success/failure rates by category
- Average conversation length
- Turn distribution analysis

**Pattern Detection:**
- Loop detection results
- Stuck behavior identification  
- Conversation flow anomalies
- Error pattern frequency

**Classification Breakdown:**
- Performance by patch correctness category
- Agent behavior differences across problem types
- Correlation between trajectory patterns and outcomes

**Example Output Structure:**
```
data/
├── traj-evaluation/
│   ├── resolved/
│   │   ├── moatless_resolved_report.json
│   │   └── statistical_summary.json
│   ├── unresolved/
│   │   ├── openhands_unresolved_report.json
│   │   └── failure_analysis.json
│   └── empty_patch/
│       ├── swe-agent_empty_report.json
│       └── pattern_analysis.json
```

#### Customizing Rule-Based Evaluation

**Adding New Detection Rules:**

1. Create detection functions in the trajectory processing script:
```python
def detect_custom_pattern(messages: List[Dict]) -> bool:
    # Implement your custom detection logic
    return pattern_detected

def analyze_custom_metric(trajectory: Dict) -> float:
    # Calculate custom metrics
    return metric_value
```

2. Integrate into the main processing loop:
```python
# Add to trajectory analysis
custom_result = detect_custom_pattern(trajectory['messages'])
custom_metric = analyze_custom_metric(trajectory)
```

**Modifying Thresholds:**

Edit detection parameters in the script:
```python
# Loop detection threshold
LOOP_THRESHOLD = 3  # Number of repeated messages to consider as loop

# Turn count limits
MAX_REASONABLE_TURNS = 50
MIN_MEANINGFUL_TURNS = 5

# Content similarity threshold for pattern detection
SIMILARITY_THRESHOLD = 0.8
```

### Part 3: LLM-Based Trajectory Evaluation 🎆

This is the main evaluation pipeline using LLMs for trajectory assessment.

**Step 1: Configure LLM Client** ⚙️
1. Ensure your API credentials are set (see Setup section)
2. By default, the system uses DeepSeek-V3. To use a different model:
   - Create a new class in `Evaluate_Trajectory_By_LLM/llm_clients/` that inherits from `BaseLLMClient`
   - Implement the `_chat_with_messages(self, messages)` method
   - Update the import statements in the main scripts

**Step 2: Generate Trajectory Summaries** 📑
```bash
python Evaluate_Trajectory_By_LLM/trajectory_summary.py
```
This script:
- Reads original trajectory files from `data/trajectory_original/`
- Generates conversation summaries
- Outputs summaries to `data/temp/summary_data/`

**Step 3: Evaluate with LLM** 🧠
```bash
python Evaluate_Trajectory_By_LLM/receval_modification.py
```
This script:
- Uses LLM to score each trajectory
- Calculates `Intra-turns`, `Inter-turns`, and `Info-gain` metrics
- Outputs results to `data/temp/receval_result/`

**Step 4: Generate Statistics** 📊
```bash
python Evaluate_Trajectory_By_LLM/result_statistic.py
```
This script:
- Aggregates results by model/agent type
- Categories results by `Empty Patch`/`Unresolved`/`Resolved`
- Generates overall statistical information

**Multi-threaded Processing (Optional):** ⚡
For faster processing of large datasets:
```bash
python Evaluate_Trajectory_By_LLM/multi_main/multi_summary.py    # Multi-threaded summary generation
python Evaluate_Trajectory_By_LLM/multi_main/multi_receval.py    # Multi-threaded ReCEval processing
```

## Output Files 📈

- **Trajectory Summaries** 📝: `data/temp/summary_data/*.json`
- **ReCEval Results** 🎯: `data/temp/receval_result/*.json`
- **Statistical Reports** 📊: Generated by `result_statistic.py`
- **Final Evaluation** 🏆: `data/trajectory-evaluation_by_llm/`

## Customization 🛠️

### Adding New LLM Clients 🤖

1. Create a new client class in `Evaluate_Trajectory_By_LLM/llm_clients/`:
```python
from .BaseLLMClient import BaseLLMClient

class YourLLMClient(BaseLLMClient):
    def __init__(self):
        super().__init__()
        self.model = "your-model-name"
    
    def _chat_with_messages(self, messages):
        # Implement your LLM API call logic
        pass
```

2. Update imports in the main scripts to use your new client.

### Modifying Evaluation Criteria 🎛️

Edit the prompts in `Evaluate_Trajectory_By_LLM/trajectory_summary.py` and `Evaluate_Trajectory_By_LLM/receval_modification.py` to adjust:
- Summary generation criteria
- Scoring rubrics
- Evaluation dimensions

## Troubleshooting 🔧

1. **API Key Issues** 🔑: Ensure your API credentials are correctly set and have sufficient quota
2. **File Path Issues** 📁: Verify that input trajectory files exist in the expected directories
3. **Memory Issues** 💾: Use multi-threaded versions for large datasets
4. **Model Compatibility** 🤝: Ensure your LLM client is compatible with the OpenAI API format

## Citation 📚

If you use this framework in your research, please cite our paper:

```bibtex
@inproceedings{jimenez2024swebench,
    title={SWE-bench: Can Language Models Resolve Real-world Github Issues?},
    author={Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan},
    booktitle={The Twelfth International Conference on Learning Representations},
    year={2024},
    url={https://openreview.net/forum?id=VTF8yNQM66}
}
```

## Contributing 🎉

Please refer to the paper for detailed methodology and evaluation criteria. For technical issues, check the existing trajectory evaluation results and logs in the `data/temp/` directory. 🎆✨
