# BargainBench: Evaluating Bargaining Skills in Online Second-Hand Marketplaces with LLM Seller Agents


> **BargainBench** is a comprehensive framework for evaluating large language models' bargaining abilities in multi-turn e-commerce scenarios. Our approach shifts from outcome-based evaluation to process-based assessment, measuring models' ability to track and interpret buyer intents across extended negotiations.

## 📋 Overview

BargainBench introduces a **Theory of Mind (ToM) grounded evaluation framework** for bargaining scenarios, featuring:

- **🎯 Intent-Action-Tool Hierarchy**: 65 structured intents across 17 high-level goals, 39 mid-level actions, and 65 atomic tools
- **📊 Large-Scale Benchmark**: 9,892 real marketplace products across 622 categories with 3,014 evaluation tasks
- **🔄 Three-Stage Pipeline**: Intent Factory → Problem Weaver → Evaluation Center
- **🌐 Cross-Domain Applicability**: Extensible to diplomatic, medical, and educational scenarios
- **📈 Turn-Level Evaluation**: Process-based assessment rather than outcome-only metrics

#### You will construct synthetic data for the moment since our training data is under compliance review
#### For Reviewer, you might need to use your own API Key and reset the file path
#### Some of the task sample and grading report could be found at folder grader

## 🚀 Quick Start

### Prerequisites

```bash
python >= 3.8
pip install -r requirement.txt
```

### Basic Setup

1. **Clone the repository**
```bash
cd bargain-0818
```

2. **Configure API settings**
Create `config.yaml` with your LLM API credentials:
```yaml
API_KEY: "your-api-key-here"
BASE_URL: "https://api.openai.com/v1"  # or your preferred endpoint
MODEL_NAME: "gpt-4o-mini"  # or your target model
THREADING_NUM: 5
```

### 🔬 Reproducing Paper Results

#### Quick Validation (Recommended for Initial Testing)
```bash
# Evaluate on available sample tasks
python evaluate.py --sample_mode
```


#### Complete Reproduction
```bash
# Step 1: Generate complete evaluation dataset
python scripts2task.py  # Creates 3,014 evaluation tasks

# Step 2: Run full evaluation reproducing Table 2 results
python evaluate.py
```

## 🏗️ Framework Architecture

### 1. Intent Factory (`main.py`)
Extracts and refines buyer intents from marketplace dialogues using a multi-agent pipeline:
- **Extractor**: Mines candidate intent-action-tool triplets
- **Verifier**: Filters duplicates and validates consistency
- **Expert Guide**: Applies domain knowledge for semantic alignment
- **Maintainer**: Clusters and deduplicates for final intent space

```bash
# Generate intent space from raw dialogues
python main.py
```

### 2. Problem Weaver (`scripts2task.py`, `scripts2task_2stage.py`)
Converts abstract intents into concrete multi-turn bargaining scenarios:
- Grounds intents in real product metadata
- Generates buyer queries with turn-level annotations
- Creates evaluation tasks with 20-candidate choice spaces

```bash
# Generate single-turn scenarios
python scripts2task.py

# Generate multi-turn scenarios
python scripts2task_2stage.py
```

### 3. Evaluation Center (`evaluate.py`)
Conducts turn-level intent recognition evaluation:
- **Metrics**: Precision, Recall, F1, Failure Rate
- **Analysis**: Error categorization and performance breakdown
- **Visualization**: Intent hierarchy performance heatmaps

```bash
# Run comprehensive evaluation
python evaluate.py
```

## 📊 Key Results

Our evaluation of 7 state-of-the-art LLMs reveals:

| Model | Turn 2 F1 | Turn 3 F1 | Turn 4 F1 | Avg F1 | Failure Rate |
|-------|-----------|-----------|-----------|--------|--------------|
| GPT-5-chat | 56.3 | 56.7 | 55.7 | **56.2** | **0.0%** |
| Qwen2.5-72B-Instruct | 49.8 | **55.0** | 48.8 | 51.2 | 1.0% |
| Qwen-32B | 47.3 | 47.8 | 46.5 | 47.2 | 1.3% |
| Gemini-1.5-Pro | 38.5 | 39.3 | 38.2 | 38.7 | 8.2% |
| DeepSeek-V3-671B | 26.3 | 26.8 | 26.1 | 26.4 | **53.4%** |

**Key Findings:**
- 📈 **Turn 3 optimal**: Most models peak at moderate context length
- 🎯 **Reliability critical**: Failure rates vary dramatically (0% to 53.4%)
- 📊 **Hierarchy challenges**: 15-20% performance gap from abstract intents to specific tools
- 🏆 **GPT-5 leads**: Perfect stability with highest consistent performance

## 📁 Dataset and Intent Space

### Intent Hierarchy (`intent_space/intent66.json`)
```
17 High-Level Intents
├── Price Negotiation
├── Product Inquiry
├── Logistics Discussion
└── ...

39 Mid-Level Actions
├── Request Price Discount
├── Ask Product Details
├── Inquire Shipping Options
└── ...

65 Atomic Tools
├── QueryProductAuthenticity
├── GetPriceAndNegotiationPolicy
├── AskDeliveryTime
└── ...
```

### Product Coverage
- **9,892** real marketplace listings
- **622** product categories (4-level hierarchy)
- **85** top-level categories
- **Average description**: 127 tokens

## 🗂️ File Structure

```
bargain-0818/
├── main.py                    # Intent Factory pipeline
├── scripts2task.py            # Problem Weaver (single-turn)
├── scripts2task_2stage.py     # Problem Weaver (multi-turn)
├── evaluate.py               # Evaluation Center
├── Agents.py                 # Multi-agent intent extraction
├── LLMClient.py              # Standardized LLM interface
├── API_Manager.py            # API handling utilities
├── config.yaml               # Main configuration
├── intent_space/             # Intent hierarchy and tools
│   ├── intent66.json         # Finalized 66-intent structure
│   ├── tree_app.py           # Interactive intent browser
│   └── merged_tree.json      # Intent tree visualization data
├── grader/                   # Evaluation metrics and analysis
├── action_space_analyze/     # Performance analysis tools
├── prompts/                  # Prompt templates
└── utils/                    # Utility functions
```

## 🔧 Configuration Options

### Model Evaluation Settings
All models evaluated with standardized parameters:
```yaml
TEMPERATURE: 0.0        # Deterministic generation
MAX_TOKENS: 512         # Response length limit
TOP_P: 1.0             # Disabled nucleus sampling
CHOICE_SPACE: 20        # Candidate intents per task
```

### Computational Requirements
- **Full evaluation**: ~120 hours across 7 models
- **Storage**: ~5GB for complete dataset and results
- **API costs**: ~$2,600 (varies by model selection)
- **Memory**: 8GB+ RAM recommended

## 📈 Analysis and Visualization

### Performance Analysis
```bash
# Generate performance breakdowns
python -m action_space_analyze.analyze_results

# Interactive intent hierarchy browser
python intent_space/tree_app.py
```

### Error Analysis Categories
1. **Intent Confusion (34%)**: Semantic similarity issues
2. **Context Integration Failures (28%)**: Multi-turn memory gaps
3. **Domain Knowledge Gaps (21%)**: Product-specific understanding
4. **Ambiguity Resolution Issues (17%)**: Unclear buyer signals

## 🌐 Cross-Domain Extensions

BargainBench's framework generalizes beyond e-commerce:

- **🏛️ Diplomatic Negotiation**: Treaty discussions, international relations
- **🏥 Medical Consultation**: Doctor-patient interactions, diagnosis dialogue
- **📚 Educational Tutoring**: Student-teacher interactions, adaptive instruction

Adaptation requires:
1. Domain-specific dialogue corpora
2. Modified intent taxonomies
3. Updated prompt templates
4. Domain-adapted expert knowledge

## 🔒 Data Availability and Privacy

**Available Resources:**
- ✅ Complete source code and framework
- ✅ 66-intent hierarchical structure (`intent_space/intent66.json`)
- ✅ Evaluation pipeline and analysis tools
- ✅ Anonymized sample evaluation tasks
- ✅ Product category taxonomies and metadata schemas

**Data Considerations:**
Due to privacy and commercial compliance requirements, certain datasets undergo internal review:
- Original marketplace dialogues contain sensitive user information
- Product listings require anonymization for public release
- Full evaluation dataset (3,014 tasks) available upon completing compliance review

**Reproduction Guarantee:**
The complete generation pipeline enables researchers to:
- Reproduce entire evaluation dataset using provided intent space
- Generate new tasks across different domains
- Validate results using sample data immediately


## 🤝 Contributing

We welcome contributions to extend BargainBench:

- **🔍 New domains**: Adapt framework to diplomatic, medical, educational scenarios
- **📊 Enhanced metrics**: Develop additional evaluation dimensions
- **🛠️ Tool improvements**: Optimize intent extraction and task generation
- **📖 Documentation**: Improve setup guides and examples
