# Embeddings Integration Guide

This document explains how to use the new embeddings functionality that has been integrated into the prompt optimization framework.

## Overview

The embeddings integration allows you to:
1. **Primary Objective**: Generate embeddings similarity heatmaps as part of existing experiments (integrated with current LLM results)
2. **Secondary Objective**: Run standalone embeddings analysis independently

## Architecture

The embeddings functionality follows the same patterns as LLMs but uses a separate registry:

- **Embeddings Registry**: `embeddings_registry` (separate from `llm_registry`)
- **Base Interface**: `EmbeddingsInterface` (similar to `LLMInterface`)
- **Implementation**: `OpenAI_Embeddings` class registered with embeddings registry

## Primary Usage: Integrated with Experiments

![Architecture Diagram with embeddings](images/architecture_2_with_embeddings.png)
![Data flow with embeddings](images/data_flow_2_with_embeddings.png)

### 1. Configuration

Add an `embeddings` section to your experiment configuration:

```yaml
# Example: config/experiments/tamper_detection_with_embeddings.yaml
embeddings:
  enabled: true
  default: OpenAI_Embeddings
  OpenAI_Embeddings:
    model: "bedrock-cohere-embed-eng-v3"
    output_format: ["png"]  # Can be ["png"], ["pdf"], or ["png", "pdf"]
```

### 2. Running Experiments with Embeddings

```bash
python examples/tamper_detection_experiment.py --config tamper_detection_with_embeddings
```

### 3. Expected Outputs

When embeddings are enabled, you'll get all the normal experiment outputs **plus**:
- Embeddings similarity heatmap (PNG/PDF)
- Similarity matrix (CSV and JSON)
- Similarity metrics in the experiment logs

### 4. Experiment Pipeline Steps

The embeddings integration adds these steps to the experiment pipeline:
1. `init_embeddings` - Initialise embeddings model
2. `generate_embeddings_analysis` - Extract prompts and generate similarity analysis
3. `save_embeddings_results` - Save heatmaps and similarity data

## Secondary Usage: Standalone Analysis

### Command Line Interface

```bash
python examples/embeddings_analysis.py \
    --texts "First prompt text" "Second prompt text" "Third prompt text" \
    --labels "Base Prompt" "Optimized Prompt" "Alternative Prompt" \
    --model "bedrock-cohere-embed-eng-v3" \
    --output "my_analysis" \
    --output-dir "./my_output" \
    --format "png" \
    --verbose
```

### Parameters

- `--texts`: List of texts to analyze (required)
- `--labels`: Optional labels for the texts
- `--model`: Embedding model to use (default: bedrock-cohere-embed-eng-v3)
- `--output`: Output filename prefix (default: embeddings_analysis)
- `--output-dir`: Output directory (default: ./output)
- `--format`: Output format - "png", "pdf", or "both"
- `--verbose`: Enable verbose logging

### Example Usage

```bash
# Basic usage
python examples/embeddings_analysis.py \
    --texts "You are an expert in detecting tampering" "You are a specialist in identifying tampering"

# With custom labels and output
python examples/embeddings_analysis.py \
    --texts "Original prompt" "Optimized prompt" \
    --labels "Base" "Optimized" \
    --output "prompt_comparison" \
    --format "both"
```

## Implementation Details

### Registry Pattern

The embeddings functionality uses the same registry pattern as LLMs:

```python
from src.core.registry import embeddings_registry

# Register an embeddings model
@embeddings_registry.register("OpenAI_Embeddings")
class OpenAIEmbeddings(EmbeddingsInterface):
    # Implementation
```

### Key Classes

1. **`OpenAIEmbeddings`**: Main embeddings implementation
   - Registered with `embeddings_registry`
   - Implements `EmbeddingsInterface`
   - Generates embeddings and similarity heatmaps

2. **`EmbeddingsAnalyzer`**: Integration wrapper
   - Extracts prompts from experiment results
   - Manages embeddings analysis workflow
   - Handles file saving and metrics calculation

### Configuration Options

```yaml
embeddings:
  enabled: true                    # Enable/disable embeddings analysis
  default: OpenAI_Embeddings      # Default embeddings model
  OpenAI_Embeddings:
    model: "bedrock-cohere-embed-eng-v3"  # Embedding model name
    output_format: ["png"]        # Output formats for heatmaps
```

## File Structure

```
prompt_optimisation/
├── src/
│   ├── core/
│   │   └── registry.py           # Added EmbeddingsRegistry
│   └── llm/
│       ├── openai_embeddings.py # Fixed to use embeddings_registry
│       └── embeddings_analyzer.py # New integration wrapper
├── examples/
│   ├── tamper_detection_experiment.py # Enhanced with embeddings steps
│   └── embeddings_analysis.py   # New standalone script
└── config/experiments/
    └── tamper_detection_with_embeddings.yaml # New config example
```

## Benefits

1. **Framework Consistency**: Uses the same patterns as LLM integration
2. **Backward Compatibility**: Existing experiments work unchanged
3. **Flexible Usage**: Both integrated and standalone capabilities
4. **Proper Separation**: Embeddings models separate from LLMs
5. **Extensible**: Easy to add more embedding models

## Troubleshooting

### Common Issues

1. **Import Errors**: Make sure `embeddings_registry` is imported in your scripts
2. **Configuration**: Ensure `embeddings.enabled: true` in your config
3. **Model Names**: Use the correct registered model name (`OpenAI_Embeddings`)
4. **Dependencies**: Ensure required packages are installed (scikit-learn, seaborn, matplotlib)

### Debugging

Enable verbose logging to see detailed information:
```bash
python examples/tamper_detection_experiment.py --config your_config --verbose
```

## Future Extensions

The embeddings framework can be easily extended to support:
- Additional embedding models (Hugging Face, Cohere, etc.)
- Different similarity metrics
- Advanced visualization options
- Batch processing capabilities