# Multi-Agent System Hallucination Quest Game

## Overview

This project implements a competition between two Multi-Agent Systems (Q-Agent) that generate and refine summaries based on news articles. The goal is to produce summaries with high factual consistency (low hallucination) while efficiently managing resources.

Each Q-Agent consists of three agent types:
1. **Policy Agent**: Makes strategic decisions about whether to continue summarizing new articles, review existing summaries, or end processing.
2. **Summary Agent**: Generates summaries from original news articles.
3. **Review Agent**: Refines summaries that have high hallucination rates.

## Setup

### Environment Setup

You can set up the required environment using conda:

```bash
# Create conda environment
conda env create -f environment.yml

# Activate the environment
conda activate mas-hq
```

Alternatively, you can install the dependencies manually:

```bash
pip install datasets openai tqdm transformers
```

### Configuration

Before running, edit `main.py` to configure your API credentials:

```python
# Initialize the OpenAI client
client = OpenAI(
    api_key='your_api_key',
    base_url='your_base_url'  # For OpenAI API, use 'https://api.openai.com/v1'
)
```

## Usage

Run the main script to start the competition:

```bash
python main.py
```

### Configurable Parameters

You can modify the following parameters in `main.py`:

- `MODEL`: The language model to use (default: 'gpt-4o-mini')
- `test_num`: Number of articles to process (default: 1000)
- `review_time`: Maximum number of review attempts per article (default: 3)
- `HALLUCINATION_THRESHOLD`: Score threshold that triggers review recommendations (default: 0.85)
- `specific_ids`: List of article IDs to include in the evaluation
- `MAX_WORKERS`: Number of concurrent API calls (default: 100)

### Models

Uncomment one of the alternative model options to use it:

```python
MODEL = 'gpt-4o-mini'
# MODEL = 'qwen-max'
# MODEL = 'deepseek-v3-250324'
# MODEL = 'gemini-2.0-flash'
# MODEL = 'grok-3-beta'
# MODEL = 'step-2-mini'
# MODEL = 'glm-4v-flash'
```

## Output

The system generates the following output files:

1. `mas1_[model]_[count]_thre_[threshold]_review_[review_time].json`: Results from the first MAS (forward order)
2. `mas2_[model]_[count]_thre_[threshold]_review_[review_time].json`: Results from the second MAS (reverse order)
3. `log_[model]_[count]_thre_[threshold]_review_[review_time].txt`: Detailed log of the competition

### Competition Results

At the end of the run, the system will output:
- Comparison of hallucination scores between the two MAS systems
- Token usage statistics
- API call counts
- Time elapsed
- Review statistics
- The winner based on a combined score of accuracy and efficiency

## System Operation

1. Two MAS run in parallel, processing articles in different orders (forward vs. reverse)
2. Each MAS makes strategic decisions via its policy agent
3. The summary agent generates initial summaries
4. The review agent improves summaries with high hallucination
5. The systems compete to achieve the best balance of factual accuracy and resource efficiency

