# Schema Induction Pipeline

## Quick Start

### 1. Run Unified Pipeline
Generate codes and refined codes for any question and iteration:

```bash
python unified_pipeline.py --mode single-qa --question "Your question here" --iterations 3
```

**Parameters:**
- `--mode`: Pipeline mode (`single-qa`, `batch-qa`, `codebook-gen`, `qa-with-codebook`)
- `--question`: Your input question or topic
- `--iterations`: Number of iterations to run (default: 2)

**Examples:**
```bash
# Single question
python unified_pipeline.py --mode single-qa --question "How to implement user authentication?" --iterations 2

# Batch processing
python unified_pipeline.py --mode batch-qa --questions-file questions.csv --iterations 3

# Generate codebook
python unified_pipeline.py --mode codebook-gen --questions-file questions.csv --iterations 2
```

### 2. Generate Final Corpus and Hierarchical Tree

### 3. Generate Response (Use This for Now)
**Note**: The unified pipeline response is using a previous version. Refinement will come soon. Use this for generating responses:

```bash
python query_final_corpus.py --corpus-path temp_files/iteration_02/topologically_sorted_graph/final_corpus.parquet --question "Your question"

**Optional parameters:**
- `--max-tags`: Maximum number of tags to include in context (default: 100)
- `--output`: Output file to save the response

**Example with options:**
```bash
python query_final_corpus.py --corpus-path temp_files/iteration_02/topologically_sorted_graph/final_corpus.parquet --question "Your question" --max-tags 50 --output response.txt
``````
For any intermediate iteration, generate the final corpus and hierarchical tree:

```bash
python generate_final_corpus.py <iteration_number>
python generate_hierarchical_tree.py <iteration_number>
```

**Examples:**
```bash
python generate_final_corpus.py 1  # Generate final corpus for iteration 1
python generate_hierarchical_tree.py 2  # Generate hierarchical tree for iteration 2
```

## What You Get

## ⚠️ Important Note
**Current Response Generation**: The unified pipeline response is using a previous version. Refinement will come soon. In the meantime, use `query_final_corpus.py` to generate responses:

```bash
python query_final_corpus.py --corpus-path temp_files/iteration_02/topologically_sorted_graph/final_corpus.parquet --question "Your question"

**Optional parameters:**
- `--max-tags`: Maximum number of tags to include in context (default: 100)
- `--output`: Output file to save the response

**Example with options:**
```bash
python query_final_corpus.py --corpus-path temp_files/iteration_02/topologically_sorted_graph/final_corpus.parquet --question "Your question" --max-tags 50 --output response.txt
``````

### From Unified Pipeline:
- **Initial codes** from your question
- **Refined codes** through multiple iterations
- **High-level codes** for better organization
- **Relationship matrices** showing code connections
- **Conflict resolution** for consistent results
- **Final answers** to your questions

### From Final Corpus Generation:
- **Complete corpus** with all codes organized
- **Code mappings** and relationships
- **Analysis reports** for each iteration

### From Hierarchical Tree Generation:
- **Hierarchical tree** for inference testing
- **Structured knowledge** representation
- **Tree visualization** files

## Pipeline Modes

### 1. Single QA (`single-qa`)
Answer a single question with multi-iteration refinement:
```bash
python unified_pipeline.py --mode single-qa --question "Your question" --iterations 2
```

### 2. Batch QA (`batch-qa`)
Process multiple questions with clustering:
```bash
python unified_pipeline.py --mode batch-qa --questions-file questions.csv --iterations 3
```

### 3. Codebook Generation (`codebook-gen`)
Generate codebooks for later use:
```bash
python unified_pipeline.py --mode codebook-gen --questions-file questions.csv --iterations 2
```

### 4. QA with Existing Codebook (`qa-with-codebook`)
Use existing codebook + multi-iteration context:
```bash
python unified_pipeline.py --mode qa-with-codebook --question "Your question" --codebook-path path/to/codebook --iterations 2
```

## Output Structure

```
temp_files/
├── iteration_01/
│   ├── embeddings/
│   ├── high_level_codes/
│   ├── cluster_sim/
│   ├── nli_classify/
│   ├── conflict_detection/
│   └── topologically_sorted_graph/
├── iteration_02/
│   └── ...
└── iteration_03/
    └── ...
```

## Key Features

- **🔄 Multi-iteration refinement** - Codes get better with each iteration
- **🧠 LLM-powered selection** - Uses advanced language models for code selection
- **⚖️ Load balancing** - Distributes work across multiple servers
- **🔍 Conflict detection** - Resolves inconsistencies automatically
- **🌳 Hierarchical organization** - Creates structured knowledge trees
- **📊 Performance monitoring** - Tracks progress and statistics

## Requirements

- Python 3.8+
- Required packages: `pandas`, `numpy`, `aiohttp`, `asyncio`
- Access to NLI classifier servers (configured via environment variables)

## Environment Setup

Set these environment variables:
```bash
export CLASSIFIER_URL="your_primary_classifier_url"
export CLASSIFIER_URL_2="your_secondary_classifier_url"
export VLLM_QWEN_32B_URL="your_llm_server_url"
export VLLM_QWEN_32B_MODEL="Qwen/Qwen3-32B"
```

## Troubleshooting

- **No results**: Check that your question is clear and specific
- **Slow performance**: Reduce iteration count or check server availability
- **Missing files**: Ensure all required directories exist and have proper permissions
