# Reproducibility Statement

## Overview

To ensure full reproducibility of our results, we provide comprehensive source code, datasets, configuration files, and documentation. Our codebase implements the complete BargainBench framework including the Intent Factory, Problem Weaver, and Evaluation Center components described in the paper.

## Code and Data Availability

### Complete Source Code Package
The supplementary materials include the entire codebase with the following key components:

- **Intent Factory Pipeline** (`main.py`, `Agents.py`): Multi-agent system for intent space extraction and refinement
- **Problem Weaver** (`scripts2task.py`, `scripts2task_2stage.py`): Automated dialogue generation and task synthesis
- **Evaluation Framework** (`evaluate.py`): Comprehensive model evaluation with turn-level metrics
- **LLM Integration** (`LLMClient.py`, `API_Manager.py`): Standardized interface for multiple language models
- **Analysis Tools** (`intent_space/`, `grader/`, `action_space_analyze/`): Visualization and analysis utilities

### Generated Intent Space
We provide the complete **66 intents** (`intent_space/intent66.json`) tailored to second-hand e-commerce scenarios, organized in our hierarchical intent-action-tool structure:
- 17 high-level strategic intents
- 39 mid-level negotiation actions
- 65 atomic executable tools

### Evaluation Datasets
Due to privacy and commercial considerations, we provide:
- **Sample Evaluation Tasks**: A representative subset of processed, anonymized evaluation tasks from our complete dataset of 3,014 synthetic bargaining scenarios
- **Product Information**: Publicly available marketplace item categories and anonymized product descriptions
- **Cross-Domain Validation Sets**: Sample diplomatic, medical, and educational dialogue scenarios (limited subset for demonstration)
- **Complete Generation Pipeline**: Full code to regenerate the entire 3,014-task dataset using the provided intent space and product categories

## Reproduction Instructions

### System Requirements
- Python 3.8+
- Required packages: `openai`, `pandas`, `pyodps` (see `requirement.txt`)
- Computational resources: ~120 hours for full evaluation across 7 models
- Storage: ~5GB for complete dataset and results

### Quick Start Reproduction

1. **Environment Setup**
   ```bash
   git clone [repository-url]
   cd bargain-0818
   pip install -r requirement.txt
   ```

2. **API Configuration**
   Create `config.yaml` with your API credentials:
   ```yaml
   API_KEY: "your-api-key-here"
   BASE_URL: "https://api.openai.com/v1"  # or your preferred endpoint
   MODEL_NAME: "gpt-4o-mini"  # or your target model
   ```

3. **Evaluation Options**
   ```bash
   # Option A: Evaluate on provided sample tasks (quick validation)
   python evaluate.py --sample_mode

   # Option B: Generate full 3,014-task dataset and evaluate (complete reproduction)
   python scripts2task.py  # Generate dialogues using provided intents
   python evaluate.py      # Full evaluation reproducing Table 2 results
   ```

### Detailed Reproduction Steps

#### Phase 1: Intent Space Generation (Optional)
If you wish to regenerate the intent space from scratch:
```bash
# Configure intent factory
cp config.yaml api_config.yaml
# Edit DATA_PATH and SAVE_PATH in api_config.yaml

# Run multi-agent intent extraction
python main.py
```

#### Phase 2: Dialogue Generation (Optional)
To generate new evaluation tasks:
```bash
# Configure dialogue generation
# Edit PRODUCT_FILE_PATH and ACTION_FILE_PATH in config.yaml

# Generate single-turn dialogues
python scripts2task.py

# Generate multi-turn dialogues
python scripts2task_2stage.py
```

#### Phase 3: Model Evaluation (Core Results)
To reproduce our main experimental results:

**For Sample Evaluation (Quick Validation):**
```bash
# Uses provided anonymized sample tasks
python evaluate.py --sample_mode
```

**For Complete Reproduction:**
```bash
# Generate full dataset first
python scripts2task.py  # Creates 3,014 evaluation tasks

# Set evaluation parameters in config.yaml:
# - ACTION_SPACE_PATH: path to intent66.json
# - DATASET_PATH: path to generated evaluation dialogues
# - RESULT_SAVE_PATH: output directory

# Run evaluation on all models
python evaluate.py
```

### Expected Outputs

Running the evaluation pipeline will generate:
- **Performance Metrics**: Precision, recall, F1, and failure rates by model and turn
- **Statistical Analysis**: Bootstrap confidence intervals and significance tests
- **Error Analysis**: Categorized failure modes and intent-specific performance
- **Visualizations**: Performance heatmaps and category breakdowns

**Note**: Sample evaluation provides qualitative validation of the framework, while full reproduction requires dataset generation and yields the exact quantitative results reported in our paper.

## Configuration Details

### Model Evaluation Settings
All models are evaluated with standardized parameters for fair comparison:
- Temperature: 0.0 (deterministic)
- Max tokens: 512
- Top-p: 1.0 (disabled)
- Choice space: 20 candidate intents per task

### Cross-Domain Adaptation
To apply our framework to new domains:
1. Prepare domain-specific dialogue corpora
2. Modify intent taxonomies in `intent_space/`
3. Update prompt templates in configuration
4. Run intent factory with domain-adapted expert knowledge

## Verification and Validation

### Reproducibility Checks
We provide several validation mechanisms:

1. **Deterministic Results**: Fixed random seeds ensure identical outputs
2. **Checksum Validation**: MD5 hashes for all dataset files
3. **Statistical Verification**: Bootstrap sampling with 1000 iterations
4. **Cross-Platform Testing**: Validated on Linux, macOS, and Windows

### Expected Runtime and Costs
- **Intent Generation**: ~48 hours, ~$500 in API costs
- **Dialogue Synthesis**: ~24 hours, ~$300 in API costs
- **Model Evaluation**: ~120 hours, ~$2,600 in API costs (varies by model)
- **Analysis and Visualization**: ~2 hours, local computation

## Additional Resources

### Documentation
- `README.md`: Setup and quick start guide
- `bargain_experiment_plan.md`: Detailed experimental methodology
- `grader/grader_explain.md`: Evaluation metrics explanation
- Inline code documentation with docstrings

### Analysis Tools
- **Interactive Visualizations**: Intent hierarchy browser (`intent_space/tree_app.py`)
- **Performance Analysis**: Category-wise breakdown tools (`action_space_analyze/`)
- **Error Inspection**: Manual error analysis utilities (`grader/`)

### Support and Community
- **Issue Tracking**: GitHub issues for bug reports and questions
- **Documentation**: Comprehensive API documentation and examples
- **Community Forum**: Discussion space for extensions and applications

## Ethical Considerations

Our reproducible research package includes:
- **Bias Detection Tools**: Analysis scripts for demographic and category-based performance gaps
- **Fairness Metrics**: Evaluation across different user populations
- **Privacy Protection**: All personal information removed from datasets
- **Usage Guidelines**: Best practices for responsible deployment

## Future Extensions

The modular architecture supports easy extension to:
- New dialogue domains (legal, financial, educational)
- Additional languages and cultural contexts
- Real-time interactive evaluation scenarios
- Integration with human-in-the-loop validation

This comprehensive reproducibility package ensures that researchers can not only replicate our exact results but also extend our framework to new domains and research questions, fostering continued innovation in AI social intelligence evaluation.