# LLM BAR Dataset Evaluation Results - Complete Collection

This folder contains the complete evaluation results for all 5 LLM BAR datasets using the MOE JUDGE system.

## **File Structure**

### **Complete Results Files**
- `natural_dataset_complete_results.json` - Natural dataset complete evaluation results (100 instances)
- `gptinst_dataset_complete_results.json` - GPTInst dataset complete evaluation results (92 instances)
- `manual_dataset_complete_results.json` - Manual dataset complete evaluation results (46 instances)
- `gptout_dataset_complete_results.json` - GPTOut dataset complete evaluation results (47 instances)
- `neighbor_dataset_complete_results.json` - Neighbor dataset complete evaluation results (134 instances)

### **Metrics Summary Files**
- `natural_dataset_metrics.json` - Natural dataset performance metrics
- `gptinst_dataset_metrics.json` - GPTInst dataset performance metrics
- `manual_dataset_metrics.json` - Manual dataset performance metrics
- `gptout_dataset_metrics.json` - GPTOut dataset performance metrics
- `neighbor_dataset_metrics.json` - Neighbor dataset performance metrics

## **Performance Summary**

| Dataset | Total Instances | Accuracy | Average Confidence | Avg Dimensions/Instance |
|---------|----------------|----------|-------------------|----------------------|
| **Natural** | 100 | **94.0%** | 94.3% | 6.07 |
| **GPTInst** | 92 | **90.0%** | 90.9% | ~7.05 |
| **Manual** | 46 | **78.3%** | 90.3% | 5.74 |
| **GPTOut** | 47 | **78.7%** | 85.5% | 6.09 |
| **Neighbor** | 134 | **78.4%** | 90.1% | 6.01 |

## **Dataset Descriptions**

### **Natural Dataset**
- **Type**: Natural language instructions
- **Difficulty**: Easiest for MOE system to evaluate
- **Performance**: Highest accuracy (94.0%)
- **Characteristics**: Standard natural language tasks

### **GPTInst Dataset**
- **Type**: Adversarial dataset (GPT-generated instructions)
- **Difficulty**: Moderate challenge
- **Performance**: Strong accuracy (90.0%)
- **Characteristics**: GPT-generated adversarial examples

### **Manual Dataset**
- **Type**: Adversarial dataset (manually crafted)
- **Difficulty**: Challenging
- **Performance**: Moderate accuracy (78.3%)
- **Characteristics**: Human-crafted adversarial examples

### **GPTOut Dataset**
- **Type**: Adversarial dataset (GPT output-based)
- **Difficulty**: Challenging
- **Performance**: Moderate accuracy (78.7%)
- **Characteristics**: Based on GPT outputs

### **Neighbor Dataset**
- **Type**: Adversarial dataset (neighbor-based)
- **Difficulty**: Challenging
- **Performance**: Moderate accuracy (78.4%)
- **Characteristics**: Neighbor-based adversarial examples
