# PersonalContextWeaver: A Document Generation Benchmark Framework

This framework provides comprehensive benchmarking for LLM's capability on generating personalized documents from long conversations. It evaluates four key aspects:

1. **User Profile Inference**: How accurately can the LLM infer user profiles from message content only?
2. **Intent Capture**: How accurately can the LLM capture user intent from queries into structured schemas?
3. **Citation Accuracy**: How well does the LLM cite source messages and base content on facts?
4. **Document Quality**: How good are the generated documents for human readers (using LLM-as-a-judge)?

## Architecture

### Core Components

- **`document_generation.py`**: Main benchmark framework with all evaluation logic
- **`benchmark_analysis.py`**: Analysis and visualization tools for results
- **`run_benchmark.py`**: Test runner and demonstration script
- **`benchmark_config.json`**: Configuration file for customizing benchmark parameters

### Data Structure

```
benchmark/
├── src/
│   ├── document_generation.py      # Main benchmark framework
│   ├── benchmark_analysis.py       # Analysis and visualization
│   ├── run_benchmark.py           # Test runner and interactive menu
│   ├── benchmark_config.json      # Configuration file
│   ├── synthetic_queries/          # Domain-specific user queries
│   │   ├── generated_user_queries_Finance.json
│   │   ├── generated_user_queries_Healthcare.json
│   │   ├── generated_user_queries_Manufacturing.json
│   │   └── generated_user_queries_Technology.json
│   └── results/                    # Benchmark results and evaluations
│       ├── eval_reference_based.py  # Reference-based evaluation script
│       ├── golden_documents/         # Extracted golden reference documents
│       └── [model_results]/          # Results by model (GPT-4.1, GPT-4o, etc.)
├── data/                           # Domain conversation data
│   ├── Finance/
│   │   ├── synthetic_domain_channels_Finance.json
│   │   └── synthetic_domain_channels_graph_Finance.gml
│   ├── Healthcare/
│   ├── Manufacturing/
│   └── Technology/
└── requirements.txt               # Python dependencies
```

## Setup

### Prerequisites

1. **Python 3.8+** 
2. **Azure OpenAI Service** with GPT-4 access
3. **Azure authentication** configured (Azure CLI or environment variables)

### Installation

1. Install dependencies:
```bash
pip install -r requirements.txt
```

2. Set up Azure OpenAI credentials:
```bash
# Option 1: Environment variable
export ENDPOINT_URL="https://your-endpoint.openai.azure.com/"

# Option 2: Azure CLI login
az login
```

3. Update `benchmark_config.json` with your settings if needed.

## Quick Start

### 1. Navigate to Source Directory

```bash
cd src/
```

### 2. Run Interactive Benchmark Runner

```bash
python run_benchmark.py
```

This provides an interactive menu with options:

1. **Test individual components** - Verify setup and test framework components
2. **Run sample benchmark (Finance domain)** - Quick 3-query test
3. **Run multi-domain comparison** - Compare performance across all domains
4. **Run custom benchmark** - Choose domain and number of queries
5. **Analyze existing results** - Generate reports from previous runs
6. **Exit**

### 3. First Time Setup Validation

Choose option **1** to validate your setup:
- Checks for required data files
- Tests Azure OpenAI connection
- Verifies all components work correctly

### 4. Run Sample Benchmark

Choose option **2** for a quick test:
- Processes 3 Finance domain queries
- Shows all evaluation metrics
- Takes ~5-10 minutes depending on model

### 5. Reference-Based Evaluation

For ROUGE/BLEU/METEOR metrics against golden documents:

```bash
# From the src/ directory
python results/eval_reference_based.py --model_dir GPT-5 --output_json gpt5_results.json
```

## Detailed Usage

### 1. User Profile Inference Benchmark

Tests how well the LLM can infer user characteristics from messages:

```python
user_profile = benchmark.infer_user_profile(messages, user_id)
# Returns: UserProfile with role, expertise, style, tone, etc.
```

**Evaluation**: Compares inferred profile against ground truth persona data.

### 2. Intent Capture Benchmark  

Tests how well the LLM captures document generation intent:

```python
intent = benchmark.capture_user_intent(query, context)
# Returns: IntentSchema with document_type, audience, scope, etc.
```

**Evaluation**: Measures accuracy of structured intent extraction.

### 3. Citation Accuracy Benchmark

Tests proper source citation and fact-based content:

```python
document = benchmark.generate_document_with_citations(messages, profile, intent)
# Returns: GeneratedDocument with content and citations
```

**Evaluation**: Validates citation relevance and accuracy against expected sources.

### 4. Document Quality Benchmark

Uses LLM-as-a-judge to evaluate generated documents:

```python
quality_scores = benchmark.evaluate_document_quality(document)
# Returns: Scores for factual accuracy, structure, readability, etc.
```

**Evaluation**: Multi-dimensional quality assessment with detailed feedback.

## Results and Analysis

### Automatic Generation

The framework automatically generates:

- **Individual result files**: `query_X_result.json` for each processed query
- **Summary statistics**: `benchmark_summary.json` with aggregated metrics  
- **Detailed report**: `detailed_report.txt` with comprehensive analysis
- **Visualizations**: Charts and graphs in `visualizations/` folder

### Key Metrics

- **User Profile Accuracy**: 0-1 score comparing inferred vs. ground truth
- **Intent Capture Accuracy**: 0-1 score for structured intent extraction
- **Citation Accuracy**: F1-score for citation precision and recall
- **Document Quality Score**: 1-5 LLM judge score across 7 dimensions
- **Overall Score**: Weighted average of all metrics

### Visualization Types

1. **Score distributions** - Histograms and box plots
2. **Component performance** - Bar charts by benchmark area
3. **Document type analysis** - Performance by document type
4. **User role analysis** - Performance by user role
5. **Quality dimensions** - Breakdown by quality criteria
6. **Citation patterns** - Citation usage and accuracy
7. **Correlation analysis** - Relationships between metrics

## Configuration

### Benchmark Settings

Edit `benchmark_config.json` to customize:

```json
{
  "benchmark_config": {
    "model_name": "gpt-4",
    "max_queries_per_run": 50,
    "temperature": 0.1,
    "evaluation_runs": 3
  },
  "evaluation_criteria": {
    "user_profile_inference": {"weight": 0.25},
    "intent_capture": {"weight": 0.25}, 
    "citation_accuracy": {"weight": 0.25},
    "document_quality": {"weight": 0.25}
  }
}
```

### Data Paths

Specify paths to your conversation data and queries:

```json
{
  "data_paths": {
    "conversation_files": {
      "Finance": "../data/Finance/synthetic_domain_channels_Finance.json",
      "Healthcare": "../data/Healthcare/synthetic_domain_channels_Healthcare.json",
      "Technology": "../data/Technology/synthetic_domain_channels_Technology.json",
      "Manufacturing": "../data/Manufacturing/synthetic_domain_channels_Manufacturing.json"
    },
    "queries_file": {
      "Finance": "./synthetic_queries/generated_user_queries_Finance.json",
      "Healthcare": "./synthetic_queries/generated_user_queries_Healthcare.json",
      "Technology": "./synthetic_queries/generated_user_queries_Technology.json",
      "Manufacturing": "./synthetic_queries/generated_user_queries_Manufacturing.json"
    },
    "output_directory": "./results/benchmark_results"
  }
}
```

## Data Format Requirements

### Conversation Data

Expected format for conversation messages:

```json
{
  "Project_Name": [
    {
      "msg_node": "Msg_123",
      "content": "Message content...",
      "author": "User_1", 
      "timestamp": "2025-06-29T01:48:38",
      "role": "Product Manager",
      "tone": "formal",
      "style": "elaborative"
    }
  ]
}
```

### User Queries

Expected format for synthetic queries:

```json
[
  {
    "query": "Generate a status report...",
    "document_type": "status_report",
    "user_id": "User_7", 
    "persona": {
      "role": "Finance Project Manager",
      "tone": "persuasive",
      "style": "elaborative"
    },
    "contextual_markers": {
      "entities": [["entity_name", "Msg_123"]]
    }
  }
]
```

## Example Results

```
BENCHMARK SUMMARY - FINANCE DOMAIN
==================================================
User Profile Accuracy    : 0.785
Intent Capture Accuracy  : 0.823  
Citation Accuracy        : 0.692
Document Quality         : 3.847
Overall Score            : 2.787

Total Queries Processed: 10
Results Directory: ./benchmark_results_finance
```

## Reproducing Published Results

### Available Benchmark Results

The repository includes reference results for multiple models:
- **GPT-4.1**: `gpt41_multi_domain_results.json`
- **GPT-4o**: `gpt4o_multi_domain_results.json` 
- **GPT-5**: `gpt_5_mini_multi_domain_results.json`
- **GPT-5-chat**: `gpt_5_chat_mini_multi_domain_results.json`
- **O4-mini**: `o4_mini_multi_domain_results.json`

### Step-by-Step Reproduction

#### 1. Setup Environment

```bash
# Install dependencies
pip install -r requirements.txt

# Configure Azure OpenAI (update with your endpoint)
export ENDPOINT_URL="https://your-endpoint.openai.azure.com/"

# Or update benchmark_config.json with your settings
```

#### 2. Generate New Benchmark Results

```bash
cd src/

# Run interactive benchmark for a specific model
python run_benchmark.py
# Choose option 3 (multi-domain comparison)
# Select evaluation mode and number of queries (40 per domain for full reproduction)
```

#### 3. Run Reference-Based Evaluation

```bash
# From the src/ directory
python results/eval_reference_based.py --model_dir GPT-5 --output_json my_gpt5_results.json
```

#### 4. Compare Results

Compare your results with published benchmarks:

**Expected GPT-5 Results (Finance Domain):**
- ROUGE-1: 39.54%
- ROUGE-2: 7.82%
- ROUGE-L: 13.01%
- BLEU: 3.13%
- METEOR: 23.55%

**Expected Overall Averages (All Domains):**
- GPT-4.1: ROUGE-1: 37.13%, METEOR: 22.26%
- GPT-4o: ROUGE-1: 35.14%, METEOR: 22.18%
- GPT-5: ROUGE-1: 40.61%, METEOR: 23.78%
- O4-mini: ROUGE-1: 28.19%, METEOR: 13.43%

### Key Configuration Parameters

To reproduce exact results, ensure these settings in `benchmark_config.json`:

```json
{
  "benchmark_config": {
    "model_name": "gpt-5-chat",
    "temperature": 0.1,
    "max_queries_per_run": 40,
    "max_messages_per_context": 50,
    "evaluation_runs": 1
  },
  "data_paths": {
    "conversation_files": {
      "Finance": "../data/Finance/synthetic_domain_channels_Finance.json",
      "Healthcare": "../data/Healthcare/synthetic_domain_channels_Healthcare.json",
      "Technology": "../data/Technology/synthetic_domain_channels_Technology.json",
      "Manufacturing": "../data/Manufacturing/synthetic_domain_channels_Manufacturing.json"
    },
    "queries_file": {
      "Finance": "./synthetic_queries/generated_user_queries_Finance.json",
      "Healthcare": "./synthetic_queries/generated_user_queries_Healthcare.json",
      "Technology": "./synthetic_queries/generated_user_queries_Technology.json",
      "Manufacturing": "./synthetic_queries/generated_user_queries_Manufacturing.json"
    }
  }
}
```

### Golden Documents

The framework uses golden reference documents for evaluation:
- **Location**: `src/results/golden_documents/`
- **Format**: `{domain}_{query_id}.json` (e.g., `finance_001.json`)
- **Content**: High-quality reference documents for each domain and query
- **Total**: 160 documents (40 per domain)

### Evaluation Modes

**End-to-end Evaluation** (Default):
- Uses model predictions for user profiles and intent
- Tests complete pipeline performance
- More realistic but potentially lower scores

**Ground Truth Evaluation**:
- Uses actual user profiles and intent labels
- Tests document generation quality in isolation
- Higher scores, isolates generation capability

### Troubleshooting Reproduction

**Different Results?**
1. Check model version and parameters
2. Verify temperature setting (should be 0.1)
3. Ensure using same query files and golden documents
4. Check Azure OpenAI API version

**Missing Files?**
1. Verify all synthetic_queries files exist
2. Check data/ directory has all domain files
3. Ensure golden_documents directory is populated

**API Errors?**
1. Update Azure endpoint in config
2. Check API rate limits and quotas
3. Verify model access permissions

## Advanced Features

### Multi-Domain Comparison

```python
# Compare performance across domains
all_results = run_multi_domain_comparison([
    "Finance", "Technology", "Healthcare", "Manufacturing"
], max_queries=10)
```

### Custom Evaluation Criteria

Override default scoring methods:

```python
benchmark._calculate_profile_accuracy = custom_profile_scorer
benchmark._calculate_intent_accuracy = custom_intent_scorer
```

### Batch Processing

Process large query sets efficiently:

```python
# Process in batches to manage memory/API limits
for batch in query_batches:
    batch_results = benchmark.run_comprehensive_benchmark(
        conversation_file=conv_file,
        queries_file=batch_file,
        max_queries=50
    )
```

## Troubleshooting

### Common Issues

1. **File not found errors**:
   ```
   Error: [Errno 2] No such file or directory: './generated_user_queries_Finance.json'
   ```
   **Solution**: Update `benchmark_config.json` paths to use `./synthetic_queries/` prefix

2. **Azure authentication errors**: 
   - Ensure Azure CLI is logged in: `az login`
   - Or set environment variable: `export ENDPOINT_URL="https://your-endpoint.openai.azure.com/"`
   - Update `azure_endpoint` in `benchmark_config.json`

3. **Unicode/emoji errors on Windows**:
   ```
   UnicodeEncodeError: 'charmap' codec can't encode character
   ```
   **Solution**: All emojis have been removed from scripts for Windows compatibility

4. **Memory issues with large datasets**: Reduce `max_messages_per_context` in config

5. **API rate limits**: Add delays between requests or reduce batch sizes

6. **Missing evaluation dependencies**:
   ```
   ModuleNotFoundError: No module named 'evaluate'
   ```
   **Solution**: Install evaluation dependencies: `pip install evaluate nltk`

7. **Missing golden documents**: Ensure `golden_documents/` directory exists in `results/`

### Performance Optimization

- **Faster processing**: Use smaller context windows (`max_messages_per_context: 25`)
- **Memory efficiency**: Process domains separately rather than all at once
- **API efficiency**: Set appropriate `temperature: 0.1` for consistent results
- **Parallel processing**: Run multiple domains simultaneously on different machines

### Validation Steps

1. **Check file structure**:
   ```bash
   ls src/synthetic_queries/  # Should show 4 JSON files
   ls data/Finance/          # Should show conversation JSON and GML files
   ```

2. **Test API connection**:
   ```python
   from document_generation import DocumentGenerationBenchmark
   benchmark = DocumentGenerationBenchmark()
   # Should initialize without errors
   ```

3. **Verify configuration**:
   ```bash
   python -c "import json; print(json.load(open('benchmark_config.json'))['data_paths'])"
   ```

## Contributing

When extending the framework:

1. Follow the dataclass patterns for new result types
2. Add corresponding analysis methods in `BenchmarkAnalyzer`
3. Update configuration schema for new parameters
4. Include visualization support for new metrics
5. Update documentation and examples

## License

This framework is provided as-is for research and evaluation purposes.
