# MAJ-Eval: Multi-Agent Debate Framework for Evaluation

MAJ-Eval is a sophisticated multi-agent debate framework designed for comprehensive evaluation tasks. It leverages multiple AI agents with distinct personas and perspectives to engage in structured debates, providing more nuanced and well-rounded evaluations compared to single-agent approaches.

## Overview

This framework implements a two-stage evaluation process:
1. **Step 1: Stakeholder Persona Creation**: Construct stakeholder personas based on provided research papers.
2. **Step 2: Multi-Agent-as-Judge Debate Evaluation**: Stakeholder agents engage in structured debates, considering each other's viewpoints and potentially refining their evaluations.


## Prerequisites

- Python 3.8+
- AWS Account (for Claude Bedrock integration) / OpenAI API Key (for OpenAI models) / Appropriate model API credentials

## Installation

1. **Clone the repository**
   ```bash
   git clone <repository-url>
   cd MAJ-Eval
   ```

2. **Install dependencies**
   ```bash
   pip install -r requirements.txt
   ```

3. **Set up environment variables**
   ```bash
   # For OpenAI
   export OPENAI_API_KEY="your-openai-api-key"
   
   # For AWS Bedrock (Claude)
   export AWS_ACCESS_KEY_ID="your-aws-access-key"
   export AWS_SECRET_ACCESS_KEY="your-aws-secret-key"
   export AWS_DEFAULT_REGION="us-east-1"  # or your preferred region
   ```

## Project Structure

```
MAJ-Eval/
├── evaluation_debate.py          # Main debate framework
├── persona_creation.py           # Persona generation from research papers
├── claude_api.py                # AWS Bedrock Claude integration
├── requirements.txt             # Python dependencies
├── MAJ with Customized Model/   # Custom model implementations with Claude and Qwen
│   ├── persona_creation_Qwen.py
│   ├── persona_creation_Claude.py
│   ├── evaluation_debate_Claude_QAG.py
│   └── evaluation_debate_Qwen_QAG.py
├── ChildEduPapers/              # Research papers for persona creation
├── MedPapers/                   # Medical domain papers
├── Expert Annotation Data/      # Human evaluation data
│   └── StorySparkQA Human Score.csv
│   └── MSLR-COCHRANE Human Score.csv
└── NewOpinions2Personas/        # Generated personas and opinions
```

## Configuration

### Model Configuration

The framework supports different model configurations:

```python
from evaluation_debate import ModelConfig, ModelProvider

# OpenAI Configuration
openai_config = ModelConfig(
    provider=ModelProvider.OPENAI,
    model_name="gpt-4",
    api_key="your-api-key",
    model_kwargs={"temperature": 0.7}
)

# Claude Configuration
claude_config = ModelConfig(
    provider=ModelProvider.CLAUDE,
    model_name="your claude api info"
)
```

## Run MAJ-Eval

### Step 1: Stakeholder Persona Creation

```python persona_creation.py```

### Step 2: Multi-Agent-as-Judge Evaluation Debate

```python evaluation_debate.py```

## Important Notes

### Path Configuration
**IMPORTANT**: Update the following paths in your configuration:
- Research paper directories (`ChildEduPapers/`, `MedPapers/`)
- Output directories for logs and results
- CSV file paths for evaluation data
- Model-specific configuration files

### API Rate Limits
- Implement appropriate rate limiting for API calls
- Consider using async semaphores for concurrent requests
- Monitor API usage and costs

