# Hybrid Reinforcement Learning Framework

Implementation of "Mitigating Hallucinations in Large Language Models via Hybrid Reinforcement Learning" research paper.

## Overview

This framework implements a novel Hybrid Reinforcement Learning (HRL) approach that combines Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) to reduce hallucinations in large language models while maintaining text quality.

## Key Features

- **Adaptive Alpha Weighting**: Dynamic integration of human and AI feedback based on context complexity and training progress
- **Multiple Training Methods**: SFT, RLHF, RLAIF, Static Hybrid, and HRL implementations
- **Comprehensive Evaluation**: TruthfulQA and MMLU benchmarks with multiple metrics
- **Modular Architecture**: Clean separation of concerns for easy extension and modification
- **Extensive Visualization**: Training curves, performance comparisons, ablation studies

## Project Structure

```
├── config.py              # Configuration classes and default settings
├── datasets.py            # TruthfulQA and MMLU dataset loaders
├── models.py              # Base language model wrapper
├── feedback.py            # Human and AI feedback modules
├── training_methods.py    # All training method implementations
├── experiments.py         # Experiment runner and orchestration
├── visualization.py       # Plotting and reporting functions
├── utils.py               # Utility functions and logging
├── main.py               # Main execution script
├── requirements.txt      # Python dependencies
└── README.md            # This file
```

## Installation

1. Clone this repository
2. Install dependencies:
```bash
pip install -r requirements.txt
```

## Usage

### Basic Execution

Run all experiments with default settings:
```bash
python main.py
```

### Command Line Options

```bash
python main.py --epochs 20                 # Set number of training epochs
python main.py --quick-test                # Run quick test (5 epochs)
python main.py --skip-ablation             # Skip ablation studies
python main.py --skip-domain               # Skip domain experiments
python main.py --skip-viz                  # Skip visualization generation
python main.py --load-results results.json # Load existing results
```

### Programmatic Usage

```python
from config import DEFAULT_MODEL_CONFIGS
from experiments import ExperimentRunner

# Initialize runner
runner = ExperimentRunner(DEFAULT_MODEL_CONFIGS)

# Run main experiments
results = runner.run_experiments(epochs=20)

# Run ablation study
ablation_results = runner.run_ablation_study(DEFAULT_MODEL_CONFIGS[0])

# Generate visualizations
from visualization import plot_performance_comparison
plot_performance_comparison(results, "distilgpt2")
```

## Methodology

### Framework Components

1. **Base Language Model**: Wrapper for various transformer architectures
2. **Human Feedback Module**: Simulates expert human evaluation with configurable expertise levels
3. **AI Feedback Module**: Automated evaluation with uncertainty estimation
4. **Reward Integrator**: Combines human and AI feedback using adaptive weighting

### Training Methods

- **SFT**: Standard supervised fine-tuning baseline
- **RLHF**: Pure human feedback reinforcement learning
- **RLAIF**: Pure AI feedback reinforcement learning  
- **Static Hybrid**: Fixed 50-50 weighting of human and AI feedback
- **HRL**: Adaptive hybrid weighting based on context and training progress

### Adaptive Alpha Computation

The core innovation uses dynamic weighting:

```
α(c,t) = initial_alpha × temporal_factor × complexity_factor × confidence_factor
```

Where:
- `temporal_factor`: Decreases human reliance over time
- `complexity_factor`: Increases human feedback for complex questions
- `confidence_factor`: Increases human oversight for uncertain outputs

### Evaluation Metrics

- **Factual Accuracy**: Proportion of factually correct outputs
- **Hallucination Rate**: Frequency of incorrect or unsupported statements
- **Coherence Score**: Text fluency and readability (1-5 scale)
- **Helpfulness**: Task-specific utility rating
- **Calibration Score**: Agreement between model confidence and correctness

## Expected Results

Based on the paper's findings, HRL should demonstrate:

- 5% relative improvement in factual accuracy vs best baseline
- 35% relative reduction in hallucination rate
- Maintained or improved text coherence
- Superior learning efficiency during training

## System Requirements

- Python 3.8+
- PyTorch 2.0+
- 8GB+ RAM (16GB+ recommended for larger models)
- Optional: CUDA-compatible GPU for acceleration

## Model Support

- **Primary**: LLaMA-2 7B/13B (requires HuggingFace access)
- **Fallback**: DistilGPT-2, GPT-2 variants
- **Extensible**: Any HuggingFace Transformers model

## Customization

### Adding New Models

```python
from config import ModelConfig

custom_config = ModelConfig(
    model_name="your/model-name",
    tokenizer_name="your/tokenizer-name",
    max_length=512
)
```

### Custom Training Methods

```python
from training_methods import TrainingMethod

class CustomMethod(TrainingMethod):
    def train_step(self, model, batch):
        # Your training logic
        return metrics
    
    def get_name(self):
        return "CustomMethod"
```

### Custom Datasets

```python
from torch.utils.data import Dataset

class CustomDataset(Dataset):
    def __init__(self, data_path):
        # Load your data
        pass
    
    def __getitem__(self, idx):
        # Return sample with 'question' key
        return {'question': question}
```

## Output Files

Results are saved to the `results/` directory:

- `main_experiments.json`: Complete training histories
- `ablation_studies.json`: Alpha parameter sweep results  
- `domain_experiments.json`: Domain-specific performance
- `logs/`: Timestamped execution logs

## Limitations

- Feedback modules use simulated rather than real human annotations
- Metrics are heuristic-based rather than trained evaluators
- Limited to English language evaluation
- Computational requirements scale with model size

## Contributing

1. Follow the modular architecture patterns
2. Add comprehensive docstrings
3. Include unit tests for new functionality
4. Update configuration classes as needed

## Citation

If you use this implementation, please cite the original paper:

```bibtex
@inproceedings{hrl2026,
  title={Mitigating Hallucinations in Large Language Models via Hybrid Reinforcement Learning},
  author={Anonymous},
  booktitle={International Conference on Learning Representations},
  year={2026}
}
```

## License

This implementation is provided for research purposes. Please ensure compliance with model licenses when using LLaMA or other restricted models.