# IRT Benchmark Project

This project contains code for running Item Response Theory (IRT) benchmarks and analyzing model performance across multiple benchmarks.

## Project Structure

- `model_training/` - Code for training IRT models
- `metric_calculation/` - Code for calculating performance metrics
- `visualization/` - Code for generating visualizations
- `results_processing/` - Code for processing and summarizing results
- `utils/` - Utility scripts and helper functions
- `docs/` - Documentation and reports

## Core Components

### Model Training
- `run_improved_mix_benchmark.py` - Main script for running improved mixed benchmark IRT experiments
- `run_benchmark_single.py` - Script for running single benchmark experiments
- `modeling_single.py` - Single benchmark IRT modeling implementation

### Metric Calculation
- Various `calculate_*.py` scripts for computing different performance metrics
- Comparison scripts for evaluating model performance

### Visualization
- Various `visualize_*.py` scripts for generating plots and charts

### Results Processing
- Scripts for summarizing and processing experiment results

## Usage

1. Run model training scripts to generate IRT model results
2. Use metric calculation scripts to compute performance metrics
3. Generate visualizations using the visualization scripts
4. Process and summarize results with the results processing scripts

## Requirements

- Python 3.x
- PyMC
- ArviZ
- NumPy
- Pandas
- Matplotlib
- Scikit-learn