# Two-Stage Data Synthesis Process

This directory contains a two-stage data synthesis pipeline that separates profile/policy/tools generation from Q&A generation.

## Overview

The synthesis process has been split into two stages:

1. **Stage 1**: Generate profiles, policies, and tools (`Synthesize.py`)
2. **Stage 2**: Generate Q&A pairs using data from Stage 1 (`Synthesize_QA.py`)

This separation allows for:
- More flexible workflow management
- Ability to generate multiple Q&A sets from the same base data
- Better debugging and iteration
- Independent scaling of each stage

## Files

### Core Files
- `Synthesize.py` - First stage generation (profiles, policy, tools)
- `Synthesize_QA.py` - Second stage generation (Q&A pairs)
- `details.json` - Metadata file connecting the two stages
- `example_details.json` - Example of the details.json structure

### Generator Dependencies
- `profile_generator.py` - Profile generation logic
- `policy_generator.py` - Policy generation logic
- `tool_generator.py` - Tool generation logic
- `qa_generator.py` - Q&A generation logic

## Usage

### Stage 1: Generate Base Data

Run the first stage to generate profiles, policies, and tools:

```bash
python Synthesize.py
```

This will:
1. Generate hierarchical profiles
2. Generate policy documents
3. Generate tool definitions
4. Save metadata to `details.json`

**Output Structure:**
```
Generated_data/
├── details.json          # Metadata for Stage 2
├── Profiles/             # Profile JSON files
│   ├── layer_1_profiles.json
│   ├── layer_2_profiles.json
│   └── layer_3_profiles.json
├── Policy/               # Policy documents
│   └── Policy.md
└── Tools/                # Tool definitions
    └── all_tools.py
```

### Stage 2: Generate Q&A

After Stage 1 completes, run the second stage to generate Q&A pairs:

```bash
# Basic usage - uses default base directory
python Synthesize_QA.py

# Specify custom directory
python Synthesize_QA.py --base_dir "My_Generated_data"

# Specify exact details.json path
python Synthesize_QA.py --details_path "Generated_data/details.json"

# Generate more Q&A pairs
python Synthesize_QA.py --num_requests 100

# Disable rollouts (faster generation)
python Synthesize_QA.py --no_rollouts
```

**Final Output Structure:**
```
Generated_data/
├── details.json          # Updated with QA completion info
├── Profiles/             # Profile JSON files
├── Policy/               # Policy documents
├── Tools/                # Tool definitions
└── Queries/              # NEW: Q&A data
    └── qa.json
```

## Command Line Options for Stage 2

| Option | Description | Default |
|--------|-------------|---------|
| `--details_path` | Path to details.json file | Auto-detected |
| `--base_dir` | Base directory containing details.json | "Generated_data" |
| `--num_requests` | Number of Q&A pairs to generate | 50 |
| `--no_rollouts` | Disable rollouts in generation | False |

## Details.json Structure

The `details.json` file contains all metadata needed for Stage 2:

```json
{
  "configuration": {
    "profile_complexity_depth": 3,
    "profile_complexity_width": 500,
    "number_of_tasks": 5,
    "attribute_dict": [...]
  },
  "layer_lookup_tables": {
    "1": ["AB", "CD", ...],
    "2": ["QR", "ST", ...],
    "3": ["DD", "EE", ...]
  },
  "task_layer_requirements": {
    "1": {"task_1": [...], ...},
    "2": {"task_1": [...], ...},
    "3": {"task_1": [...], ...}
  },
  "global_attributes": {
    "global_attribute_1": "value1",
    "global_attribute_2": "value2"
  },
  "profiles_path": "Generated_data/Profiles/",
  "generation_stage": "first_stage_complete"
}
```

## Configuration

### Modifying Stage 1 Configuration

Edit the `main()` function in `Synthesize.py`:

```python
# Custom attribute configuration
custom_attribute_dict = [
    {  # Layer 1
        "1": "condition",
        "2": "lookup", 
        "3": "reference_2"
    },
    {  # Layer 2
        "1": "condition",
        "2": "reference_1",
        "3": "lookup"
    }
]

synthesizer = DataSynthesizer(
    profile_complexity_depth=2,  # Match attribute_dict layers
    profile_complexity_width=1000,
    number_of_tasks=3,
    attribute_dict=custom_attribute_dict,
    base_output_dir="My_Custom_Data"
)
```

## Programmatic Usage

### Using the QASynthesizer Class

```python
from Synthesize_QA import QASynthesizer

# Load from details.json path
qa_synth = QASynthesizer("Generated_data/details.json")

# Or load from base directory
from Synthesize_QA import load_qa_synthesizer_from_directory
qa_synth = load_qa_synthesizer_from_directory("Generated_data")

# Generate Q&A
qa_data = qa_synth.generate_qa(num_requests=100, include_rollouts=True)

# Print summary of loaded data
qa_synth.print_details_summary()
```

### Error Handling

The system includes comprehensive error checking:

- Verifies first stage completion before allowing Stage 2
- Validates required files exist
- Checks JSON format and structure
- Provides detailed error messages

## Benefits of Two-Stage Process

1. **Flexibility**: Generate multiple Q&A sets from same base data
2. **Debugging**: Easier to debug each stage independently  
3. **Iteration**: Modify Q&A generation parameters without regenerating base data
4. **Scalability**: Can run stages on different machines/times
5. **Resumability**: Can pause between stages and resume later

## Migration from Single-Stage

If you have existing code using the single-stage process:

**Old way:**
```python
synthesizer.synthesize_all(num_qa_requests=100)
```

**New way:**
```python
# Stage 1
synthesizer.synthesize_first_stage()

# Stage 2 (separate script or session)
from Synthesize_QA import load_qa_synthesizer_from_directory
qa_synth = load_qa_synthesizer_from_directory("Generated_data")
qa_data = qa_synth.generate_qa(num_requests=100)
```

The original `synthesize_all()` method is still available for backward compatibility. 