# 🏆 AgentSWE-bench: Supplementary Material

## 🎯 SWE-eval Framework Overview

This repository contains the **SWE-eval** framework - a multi-dimensional evaluation system for agent-driven GitHub issue resolution. The framework provides trajectory-enhanced evaluation using both rule-based and LLM-based approaches to assess software engineering agents comprehensively.

> **📝 Framework Highlights**: SWE-eval uses high-performance LLMs to replace multiple small models' scoring logic, enabling sophisticated evaluation of multi-turn conversation trajectories with metrics including Intra-turns consistency, Inter-turns coherence, and Info-gain assessment.


---

# � What is in code?

> 🎯 **Complete Implementation & Reproducible Research Materials**

## 🏗️ Code Directory Structure

| 📂 **Component** | 🎯 **Purpose** | 🛠️ **Key Features** |
|---|---|---|
| [🤖 LLM Evaluation](#trajectory-evaluation-by-llm) | AI-based trajectory scoring | Multi-threaded, statistical analysis |
| [📏 Rule-based Evaluation](#rule-based-evaluation) | Traditional assessment methods | Report generation, preprocessing |
| [🧪 SWE-bench Integration](#swe-bench-integration) | Official benchmark integration | Evaluation harness, dataset tools |

### 🤖 **Trajectory Evaluation by LLM** (`Evaluate_Trajectory_By_LLM/`)
- 🧠 **LLM-based scoring**: Advanced evaluation using language models with Intra-turns, Inter-turns, and Info-gain metrics
- ⚡ **Multi-threaded processing**: Efficient batch evaluation capabilities with dedicated multi-processing scripts
- 📊 **Statistical analysis**: Comprehensive metrics aggregation and reporting tools
- 🔧 **Modular clients**: Support for DeepSeek-V3 and OpenAI-compatible LLM providers
- 📝 **Trajectory summarization**: Automated conversation summary generation
- 🎯 **ReCEval integration**: Modified ReCEval framework for trajectory assessment

### 📏 **Rule-based Evaluation** (`Evaluate_Trajectory_By_Rule/`)
- 📐 **Pattern detection**: Loop detection, turn analysis, and statistical metrics without LLM calls
- 📋 **Multi-agent support**: Specialized processors for SWE-agent, Openhands, and Moatless trajectories
- 🔄 **Data preprocessing**: Trajectory merging utilities and patch correctness categorization
- 📊 **Classification integration**: Cross-references with patch evaluation results (resolved/unresolved/empty patch)
- 🎪 **Advanced features**: Stuck behavior identification, conversation flow analysis, and error pattern detection

### 🧪 **SWE-bench Integration** (`swebench/`)
- 🔗 **Official framework**: Direct integration with SWE-bench ecosystem
- 🏃 **Evaluation harness**: Complete testing and validation pipeline
- 📚 **Dataset tools**: Collection, processing, and management utilities
- 🐳 **Docker support**: Containerized evaluation environments

---

## � Data Components

| 🗂️ **Data Type** | 📝 **Description** | 🎯 **Usage** |
|---|---|---|
| 🎯 **Trajectory Data** | Original agent conversation logs (JSONL format) | Input for all evaluation methods |
| 📈 **Evaluation Results** | ReCEval metrics (Intra-turns, Inter-turns, Info-gain) | LLM-based trajectory assessment |
| 🔧 **Patch Analysis** | Code patch correctness evaluation results | Classification and correlation analysis |
| 📝 **Summary Reports** | Rule-based pattern analysis and statistical insights | Traditional evaluation output |
| 🧠 **LLM Summaries** | AI-generated trajectory summaries | Intermediate processing for LLM evaluation |
| 📊 **Statistical Data** | Aggregated performance metrics by agent/model type | Final analysis and reporting |

---

# 🚀 Quick Start Guide

## � Setup Process

```bash
# 1️⃣ Install dependencies
cd code/
pip install -r requirements.txt

# 2️⃣ Configure LLM API credentials (Required for LLM-based evaluation)
export OPENAI_API_KEY="your-api-key"
export OPENAI_BASE_URL="your-api-base-url"
# Or modify Evaluate_Trajectory_By_LLM/llm_clients/BaseLLMClient.py directly

# 3️⃣ Run evaluation pipeline
# Option A: Rule-based evaluation (no API required)
python Evaluate_Trajectory_By_Rule/trajectory2report-SWE-agent.py

# Option B: LLM-based evaluation (requires API)
python Evaluate_Trajectory_By_LLM/trajectory_summary.py
python Evaluate_Trajectory_By_LLM/receval_modification.py
python Evaluate_Trajectory_By_LLM/result_statistic.py

# Option C: SWE-bench patch evaluation
python swebench/harness/run_evaluation.py --dataset_name <dataset> --predictions_path <predictions_file>
```

## 📋 Prerequisites Checklist

- ✅ **Python 3.8+**: Required runtime environment
- ✅ **Dependencies**: Install from `requirements.txt` (requests~=2.32.4, openai~=1.93.0)
- ✅ **API Keys**: LLM provider credentials (required only for LLM-based evaluation)
  - OpenAI-compatible API key and base URL
  - DeepSeek-V3 supported by default
- ✅ **Storage**: ~10GB for full dataset processing
- ✅ **Input Data**: Agent trajectory files in JSON/JSONL format
- ✅ **Patch Evaluation Results**: Classification data for trajectory categorization (optional)

---

# 📚 Documentation Hub

| 📖 **Resource** | 🎯 **Content** | 👥 **Audience** |
|---|---|---|
| `code/README.md` | Complete SWE-eval framework guide with usage examples | All users |
| `appendix.pdf` | Theoretical foundations and extended experimental results | Researchers |
| Script documentation | Function-specific implementation details | Developers |
| Inline comments | Code-level implementation explanations | Code reviewers |
| API client examples | LLM integration and customization guides | Technical integrators |

---

# 🔄 Reproducibility Guarantee

## ✅ **What We Provide:**

- 🧑‍💻 **Complete Source Code**: Fully documented implementation
- 🧪 **Sample Data**: Test datasets for validation
- 📋 **Step-by-step Instructions**: Detailed execution guides
- 🌍 **Environment Specs**: Exact configuration requirements
- 🔍 **Verification Tools**: Result validation utilities

## 🎯 **Reproducibility Levels:**

| 🏷️ **Level** | 🎯 **Scope** | ⏱️ **Time** | 💾 **Resources** |
|---|---|---|---|
| 🟢 **Quick Demo** | Rule-based evaluation sample | ~15 minutes | 2GB RAM |
| 🟡 **Partial Replication** | Single-agent LLM evaluation | ~1-2 hours | 8GB RAM |
| 🔴 **Full Reproduction** | Complete multi-agent analysis | ~24-48 hours | 32GB RAM + API costs |

---

# 📞 Support & Review Guidelines

## 🆘 **Getting Help:**

- 📖 **Theory Questions**: Refer to technical appendix (`appendix.pdf`)
- 💻 **Implementation Issues**: Check code documentation and comments
- 🔍 **Data Questions**: Review dataset structure descriptions
- 🐛 **Bug Reports**: Use detailed error logs and context

## 🎯 **For Reviewers:**

### 📋 **Recommended Review Process:**

1. **📄 Read the Appendix** (15-30 min)
   - 🎯 Understand theoretical foundations
   - 📊 Review extended experimental results

2. **💻 Examine the Code** (30-60 min)
   - 🔍 Assess implementation quality
   - ✅ Check completeness and documentation

3. **🧪 Test Reproducibility** (60-120 min)
   - 🚀 Run quick demo with sample data
   - 📊 Verify outputs match expected results

4. **📊 Cross-verify Claims** (30-45 min)
   - 🔍 Compare results with paper assertions
   - 📈 Validate statistical conclusions

### 🏆 **Quality Indicators:**

- ✅ **Code Quality**: Clear, well-documented, modular
- ✅ **Data Integrity**: Consistent, validated, complete
- ✅ **Reproducibility**: Successful execution on clean environment
- ✅ **Documentation**: Comprehensive, accurate, up-to-date

---

## � Final Notes

### �🎭 **Anonymity Notice**
All materials have been carefully anonymized per AAAI 2026 conference guidelines while maintaining full functionality and reproducibility.

### 🙏 **Acknowledgments**
Thank you for taking the time to review our supplementary materials! Your thorough evaluation helps advance the field of software engineering automation and agent-based systems.

### � **Contact**
For any questions during the review process, please refer to the documentation hierarchy:
1. 📖 Technical appendix for theoretical details
2. 💻 Code documentation for implementation specifics  
3. 📊 Data descriptions for dataset understanding

---

**🌟 Happy Reviewing! 🌟**