# Refine Iteration Package

This package contains components for refinement iterations in the schema induction pipeline.

## Files

- `datapoint_code_mapper.py` - Main class for mapping datapoints to codes and calculating frequencies
- `example_usage.py` - Example script showing how to use the mapper
- `run_mapper.py` - Command-line interface for the mapper
- `__init__.py` - Package initialization

## DatapointCodeMapper

The `DatapointCodeMapper` class provides comprehensive functionality for:

1. **Mapping Management**: Create and manage datapoint-to-code mappings
2. **Frequency Analysis**: Calculate and analyze code frequencies
3. **Data Persistence**: Save/load mappings in multiple formats (Parquet, JSON)
4. **Statistics**: Get detailed statistics about mappings and frequencies

### Key Features

- **Bidirectional Mapping**: Map datapoints to codes and codes to datapoints
- **Frequency Calculation**: Count how many datapoints each code appears in
- **Multiple Storage Formats**: Supports Parquet (preferred) and JSON formats
- **Batch Processing**: Efficiently process large numbers of mappings
- **Statistics**: Comprehensive statistics and analysis tools

### Usage Examples

#### Basic Usage

```python
from datapoint_code_mapper import DatapointCodeMapper

# Initialize mapper
mapper = DatapointCodeMapper()

# Load existing mappings
mapper.load_existing_mappings()

# Load from corpus
mapper.load_from_corpus('temp_files/corpus.parquet')

# Get codes for a datapoint
codes = mapper.get_codes_for_datapoint("your datapoint text")

# Get datapoints for a code
datapoints = mapper.get_datapoints_for_code("your code")

# Get code frequency
frequency = mapper.get_code_frequency("your code")

# Save mappings
mapper.save_mappings()
mapper.save_frequencies()
```

#### Command Line Usage

```bash
# Show statistics
python run_mapper.py --stats

# Load from corpus and save
python run_mapper.py --corpus temp_files/corpus.parquet --save

# Show top 10 most frequent codes
python run_mapper.py --top-codes 10

# Show rare codes (frequency <= 2)
python run_mapper.py --rare-codes 2

# Combine options
python run_mapper.py --corpus temp_files/corpus.parquet --stats --save
```

## File Structure

The mapper creates several files in the temp_files directory:

- `datapoint_code_mapping.parquet` - Main mapping file (datapoint → code)
- `code_datapoint_mapping.parquet` - Reverse mapping file (code → datapoint)
- `code_frequencies.parquet` - Code frequency data
- `datapoint_code_cache.json` - JSON cache for compatibility

## Statistics

The mapper provides comprehensive statistics including:

- Total datapoints and codes
- Average mappings per datapoint/code
- Code frequency distribution
- Top most frequent codes
- Rare codes (low frequency)
- Frequency range and median

## Integration

This package is designed to work with the schema induction pipeline:

1. **Input**: Loads from corpus files created by the initial iteration
2. **Processing**: Creates mappings and calculates frequencies
3. **Output**: Saves results for use in refinement iterations
4. **Storage**: Uses temp_files directory outside of utils folder

## Example Output

```
============================================================
DATAPOINT-CODE MAPPING STATISTICS
============================================================
Total datapoints: 300
Total codes: 8,714
Total mappings: 9,000
Average codes per datapoint: 30.00
Average datapoints per code: 1.03
Code frequency range: 1 - 17
Median code frequency: 1.0
Codes with frequency 1: 8,558
Codes with frequency 2+: 156
============================================================
```

This shows that most codes appear in only one datapoint (8,558 codes), while 156 codes appear in multiple datapoints, with the most frequent code appearing in 17 datapoints.
