# D-MOE-EVAL: A Dynamic Mixture-of-Experts Framework for Human-Aligned Nuanced Large Language Model Evaluation

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

## Overview

D-MOE-EVAL is a comprehensive framework for evaluating Large Language Models (LLMs) using a dynamic mixture-of-experts approach combined with jury debate systems. This framework provides nuanced, human-aligned evaluation across multiple dimensions and scenarios, supporting both LLMBar and MD-Eval datasets.

## Research Context

This implementation accompanies the research paper: **"D-MOE-EVAL: A Dynamic Mixture-of-Experts Framework for Human-Aligned Nuanced Large Language Model Evaluation"**

The framework addresses the limitations of traditional LLM evaluation methods by:
- Implementing dynamic expert selection based on evaluation context
- Utilizing jury debate systems for consensus-based evaluation
- Supporting multi-dimensional assessment across various scenarios
- Providing comprehensive evaluation across LLMBar and MD-Eval datasets

## Repository Structure

```
├── src/                          # Source code
│   ├── candidate_profiling/       # Multi-model candidate profiling
│   │   ├── __init__.py
│   │   └── run_multi_model_profiling.py
│   ├── main_pipeline/            # Main evaluation pipelines
│   │   ├── __init__.py
│   │   ├── MOE_Pipeline_Standalone.py
│   │   └── integrated_jury_system.py
│   ├── jury/                     # Jury debate system
│   │   ├── __init__.py
│   │   └── jury_debate_system.py
│   ├── llmbar/                   # LLMBar evaluation scripts
│   │   ├── llmbar_gptout_evaluation.py
│   │   ├── llmbar_natural_evaluation.py
│   │   ├── llmbar_manual_evaluation.py
│   │   ├── llmbar_neighbor_evaluation.py
│   │   └── llmbar_flexible_evaluation.py
│   └── __init__.py
├── datasets/                     # Dataset storage
│   ├── llmbar/                   # LLMBar benchmark datasets
│   └── md_eval/                  # MD-Eval dataset
│       ├── metrics.yaml
│       └── seeds.json
├── results/                      # Evaluation results
│   └── candidate_profiling/      # Candidate profiling outputs
├── config/                       # Configuration files
│   ├── requirements.txt          # Python dependencies
│   ├── pyproject.toml            # Project configuration
│   ├── config.env.template       # Environment template
│   └── README.md
├── main.py                       # Entry point wrapper (prints examples/help)
├── requirements.txt              # Top-level dependencies (if used)
├── LICENSE                       # MIT License
└── README.md                     # This file
```

## System Architecture

### Core Components

1. **MOE Candidate Profiling System**
   - Dynamic expert selection based on evaluation context
   - Multi-dimensional assessment capabilities
   - Specialized models for different evaluation scenarios

2. **Integrated Jury Debate System**
   - Consensus-based evaluation approach
   - Multi-round debate mechanism
   - Validation mode for result verification

3. **Evaluation Scripts**
   - LLMBar dataset evaluation (GPTOut, Natural, Manual, Neighbor)
   - MD-Eval dataset evaluation
   - Flexible evaluation framework

4. **Multi-Model Profiling**
   - Batch processing capabilities
   - Checkpointing and resume functionality
   - Comprehensive result aggregation

### Key Features

- **Dynamic Expert Selection**: Automatically selects appropriate experts based on evaluation context
- **Jury Consensus**: Implements debate-based consensus for reliable evaluation
- **Multi-Dataset Support**: Compatible with LLMBar and MD-Eval datasets
- **Checkpointing**: Robust progress saving and resume capabilities
- **Parallel Processing**: Multi-worker support for efficient evaluation
- **Comprehensive Metrics**: Detailed evaluation metrics and analysis

## Installation and Setup

### Prerequisites
- Python 3.8 or higher
- OpenAI API key
- Required Python packages

### Installation Steps

1. **Repository (anonymous link):**
   ```
   https://anonymous.4open.science/r/D-MOE-Eval/
   ```

2. **Install dependencies:**
   ```bash
   pip install -r config/requirements.txt
   # Or for development:
   pip install -e "config/.[dev]"
   ```

3. **Set up environment variables:**
   ```bash
   cp config/config.env.template .env
   # Edit .env with your actual API keys
   ```

## Usage Examples

### MOE Candidate Profiling
```bash
python src/main_pipeline/MOE_Pipeline_Standalone.py \
    --api_key_1 YOUR_API_KEY \
    --dataset_path "datasets/llmbar/natural/dataset.json" \
    --output_path "results/moe_profiling_results.json" \
    --nums 100 \
    --workers 19
```

### Integrated Jury System
```bash
python src/main_pipeline/integrated_jury_system.py \
    --api_key_1 YOUR_API_KEY \
    --dataset_path "datasets/llmbar/natural/dataset.json" \
    --output_path "results/jury_results.json" \
    --nums 100 \
    --workers 19
```

### LLMBar Dataset Evaluation

**Natural Dataset:**
```bash
python src/llmbar/llmbar_natural_evaluation.py \
    --api_key_1 YOUR_API_KEY \
    --api_key_2 YOUR_SECOND_API_KEY \
    --dataset_path "datasets/llmbar/natural/dataset.json" \
    --output_path "results/natural_results.json" \
    --nums 100 \
    --workers 19
```

**GPTOut Dataset:**
```bash
python src/llmbar/llmbar_gptout_evaluation.py \
    --api_key_1 YOUR_API_KEY \
    --api_key_2 YOUR_SECOND_API_KEY \
    --dataset_path "datasets/llmbar/gptout/dataset.json" \
    --output_path "results/gptout_results.json" \
    --nums 100 \
    --workers 19
```

**Manual Dataset:**
```bash
python src/llmbar/llmbar_manual_evaluation.py \
    --api_key_1 YOUR_API_KEY \
    --api_key_2 YOUR_SECOND_API_KEY \
    --dataset_path "datasets/llmbar/manual/dataset.json" \
    --output_path "results/manual_results.json" \
    --nums 100 \
    --workers 19
```

**Neighbor Dataset:**
```bash
python src/llmbar/llmbar_neighbor_evaluation.py \
    --api_key_1 YOUR_API_KEY \
    --api_key_2 YOUR_SECOND_API_KEY \
    --dataset_path "datasets/llmbar/neighbor/dataset.json" \
    --output_path "results/neighbor_results.json" \
    --nums 100 \
    --workers 19
```

### MD-Eval Dataset
```bash
python src/llmbar/llmbar_flexible_evaluation.py \
    --api_key_1 YOUR_API_KEY \
    --api_key_2 YOUR_SECOND_API_KEY \
    --dataset_path "datasets/md_eval/dataset.json" \
    --output_path "results/mdeval_results.json" \
    --nums 100 \
    --workers 19
```

### Multi-Model Candidate Profiling
```bash
python src/candidate_profiling/run_multi_model_profiling.py \
    --api_key YOUR_API_KEY \
    --dataset_path "datasets/llmbar/natural/dataset.json" \
    --output_path "results/multi_model_results.json" \
    --nums 100 \
    --workers 19
```

## Evaluation Modes

### Integrated System Evaluation

**Validation Mode:**
The integrated system operates in validation mode, where the jury debate system validates and refines the results from the MOE candidate profiling system, ensuring consensus-based evaluation outcomes.

## Output Format

Results are saved in JSON format with the following structure:
- **Evaluation Metrics**: Accuracy, consistency, and reliability scores
- **Expert Analysis**: Individual expert evaluations and consensus
- **Jury Decisions**: Debate outcomes and final judgments
- **Metadata**: Timestamps, model information, and configuration details

## Research Applications

This framework is designed for:
- **Academic Research**: Comprehensive LLM evaluation studies
- **Model Comparison**: Systematic comparison across different LLM architectures
- **Benchmark Development**: Creation of new evaluation benchmarks
- **Evaluation Methodology**: Research into evaluation techniques and metrics

## Citation

If you use this framework in your research, please cite:

```bibtex
@article{d-moe-eval2025,
  title={D-MOE-EVAL: A Dynamic Mixture-of-Experts Framework for Human-Aligned Nuanced Large Language Model Evaluation},
  author={Anonymous},
  journal={Under Review ICLR 2026 Conference},
  year={2025},
  url={https://anonymous.4open.science/r/D-MOE-Eval/}
}
```

## License and Contributing

This project is licensed under the MIT License. See the LICENSE file for details.

Contributions are welcome! Please feel free to submit a Pull Request.

## Contact

For questions and support, please contact:
- **Maintainers**: Anonymous
- **Email**: anonymous@example.com