# Schema Induction Evaluation

This directory contains evaluation tools and metrics for testing the schema induction pipeline on new data.

## Quick Start

### 1. Run Inference on New Data
Test the pipeline on new questions and data:

```bash
python test_inference/build_corpus.py \
  --test_data distribution_metric/input/your_test_data.csv \
  --question "Your question here" \
  --max_datapoints 10
```

### 2. Comprehensive Evaluation
Run full evaluation with reusability and fitness metrics:

```bash
python test_inference/comprehensive_evaluation_pipeline.py \
  --test_data distribution_metric/input/your_test_data.csv \
  --question "Your question here" \
  --train_corpus ../main_pipeline/temp_files/iteration_02/topologically_sorted_graph/final_corpus.parquet \
  --hierarchical_tree ../main_pipeline/temp_files/iteration_02/hierarchical_tree/hierarchical_tree_for_inference.json \
  --output_dir distribution_metric/results \
  --max_datapoints 10
```

## Directory Structure

```
evaluation/
├── test_inference/           # Core inference and evaluation scripts
│   ├── build_corpus.py      # Main inference pipeline
│   ├── comprehensive_evaluation_pipeline.py  # Full evaluation
│   ├── reusability_eval.py  # Reusability metrics
│   ├── descriptive_fitness_eval.py  # Fitness evaluation
│   └── ...
├── distribution_metric/      # Distribution analysis and results
│   ├── input/               # Test data files
│   ├── results/             # Evaluation results
│   └── ...
└── README.md               # This file
```

## Evaluation Components

### 🚀 Main Scripts
- **`build_corpus.py`** - Core inference pipeline for new data
- **`comprehensive_evaluation_pipeline.py`** - Full evaluation with all metrics
- **`consistency_stability_eval.py`** - Consistency and stability analysis
- **`descriptive_coverage_eval.py`** - Coverage evaluation
- **`parsimony_eval.py`** - Parsimony (simplicity) evaluation
- **`zipf_distribution_eval.py`** - Zipf distribution analysis

### 📊 Evaluation Metrics
- **Reusability Score**: Number of unique codes reused from training corpus
- **Descriptive Fitness Score**: LLM-rated quality (1-10) of how well codes describe content
- **Consistency Score**: Stability across multiple runs
- **Coverage Score**: How well codes cover the content
- **Parsimony Score**: Simplicity and efficiency of the code set
- **Zipf Distribution**: Natural language distribution analysis

## Usage Examples

### Basic Inference
```bash
# Test on new data
python test_inference/build_corpus.py \
  --test_data distribution_metric/input/aliabdaal_test.csv \
  --question "How Ali mixes teaching with entertainment?" \
  --max_datapoints 10
```

### Full Evaluation
```bash
# Complete evaluation with all metrics
python test_inference/comprehensive_evaluation_pipeline.py \
  --test_data distribution_metric/input/aliabdaal_test.csv \
  --question "How Ali mixes teaching with entertainment?" \
  --train_corpus ../main_pipeline/temp_files/iteration_02/topologically_sorted_graph/final_corpus.parquet \
  --hierarchical_tree ../main_pipeline/temp_files/iteration_02/hierarchical_tree/hierarchical_tree_for_inference.json \
  --output_dir distribution_metric/results \
  --max_datapoints 10
```

### Individual Metrics
```bash
# Reusability only
python test_inference/reusability_eval.py \
  --test_data distribution_metric/input/aliabdaal_test.csv \
  --question "How Ali mixes teaching with entertainment?" \
  --train_corpus ../main_pipeline/temp_files/iteration_02/topologically_sorted_graph/final_corpus.parquet \
  --hierarchical_tree ../main_pipeline/temp_files/iteration_02/hierarchical_tree/hierarchical_tree_for_inference.json \
  --output reusability_results.json

# Descriptive fitness only
python test_inference/descriptive_fitness_eval.py \
  --test_data distribution_metric/input/aliabdaal_test.csv \
  --question "How Ali mixes teaching with entertainment?" \
  --max_datapoints 10
```

## Input Data Format

Test data should be in CSV format with:
- **`text`** column: The content to be analyzed
- **`question`** column (optional): Specific questions for each text
- Additional metadata columns as needed

Example:
```csv
text,question
"Sample text content here","How does this relate to the topic?"
"More content to analyze","What patterns emerge?"
```

## Output Structure

Results are saved with timestamps:
```
distribution_metric/results/
├── comprehensive_evaluation_[timestamp].json
├── comprehensive_evaluation_[timestamp]_reusability.json
├── comprehensive_evaluation_[timestamp]_fitness.json
├── comprehensive_evaluation_[timestamp]_consistency.json
└── ...
```

## Requirements

- Python 3.8+
- Required packages: `pandas`, `numpy`, `aiohttp`, `asyncio`
- Access to LLM servers (configured via environment variables)
- Training corpus and hierarchical tree from main pipeline

## Environment Setup

Set these environment variables:
```bash
export VLLM_QWEN_32B_URL="your_llm_server_url"
export VLLM_QWEN_32B_MODEL="Qwen/Qwen3-32B"
```

## Troubleshooting

- **No results**: Check that test data has the correct format and columns
- **Missing files**: Ensure training corpus and hierarchical tree paths are correct
- **Slow performance**: Reduce `--max_datapoints` or check server availability

## Advanced Usage

For more detailed analysis, you can run individual evaluation components:
- `consistency_stability_eval.py` - For consistency analysis
- `descriptive_coverage_eval.py` - For coverage analysis
- `parsimony_eval.py` - For simplicity analysis
- `zipf_distribution_eval.py` - For distribution analysis

But the comprehensive evaluation pipeline handles everything automatically!
