# Schema Induction Test Inference Pipeline

This directory contains the core components for evaluating the schema induction pipeline on test data.

## Core Files

### 🚀 Main Scripts
- **`build_corpus.py`** - Core pipeline script that processes test data through open coding, code replacement, and hierarchical retrieval
- **`comprehensive_evaluation.py`** - Runs both reusability and descriptive fitness evaluations
- **`run_evaluation.py`** - Simple interface for running evaluations

### 📊 Evaluation Components
- **`prompts.py`** - Contains prompts for open coding and code replacement
- **`reusability_eval.py`** - Calculates reusability metric (number of unique codes reused from training corpus)
- **`descriptive_fitness_eval.py`** - Evaluates how well assigned codes describe the content (1-10 scale)

## Usage

### Quick Evaluation
```bash
python run_evaluation.py \
  --test_data ../../evaluation/distribution_metric/input/aliabdaal_test.csv \
  --question "How Ali mixes teaching with entertainment?" \
  --max_datapoints 10
```

### Comprehensive Evaluation
```bash
python comprehensive_evaluation.py \
  --test_data ../../evaluation/distribution_metric/input/aliabdaal_test.csv \
  --question "How Ali mixes teaching with entertainment?" \
  --train_corpus ../../main_pipeline/temp_files/iteration_02/conflict_detection/unique_codes.parquet \
  --hierarchical_tree ../../main_pipeline/temp_files/iteration_02/hierarchical_tree/inference_tree.json \
  --output_dir ../../evaluation/distribution_metric/results \
  --max_datapoints 10
```

## Output

Results are saved in `../../evaluation/distribution_metric/results/` with timestamps:
- `comprehensive_evaluation_[timestamp].json` - Combined results
- `comprehensive_evaluation_[timestamp]_reusability.json` - Reusability details
- `comprehensive_evaluation_[timestamp]_fitness.json` - Descriptive fitness details

## Metrics

- **Reusability Score**: Number of unique codes reused from training corpus
- **Descriptive Fitness Score**: LLM-rated quality (1-10) of how well codes describe content
- **Success Rate**: Percentage of generated codes successfully replaced with existing ones

## Dependencies

- Requires `.env` file with LLM server URLs and model names
- Uses training corpus and hierarchical tree from main pipeline
- Test data should be in CSV format with 'text' column
