# Implicit Embeddings Benchmark

## Overview
This benchmark evaluates embedding models on their ability to capture implicit semantics - aspects not well covered by existing benchmarks like MTEB. The focus is on understanding how well embedding models can capture nuanced meaning beyond explicit similarity.

## Requirements
- Python 3.7+
- sentence-transformers
- scikit-learn
- pandas
- numpy

Install the required packages:
```bash
pip install sentence-transformers scikit-learn pandas numpy
```

## Datasets
The benchmark includes multiple datasets:
- **PUB** (Pragmatics Understanding Benchmark): Contains multiple tasks in JSONL format
- **P-Stance**: A large dataset for stance detection in political domain
- **Social Bias Frames**: A dataset for detecting various types of bias in text
- **Implicit Hate Speech**: A dataset for understanding implicit hate speech
- **Article-Bias-Prediction**: A dataset for detecting political bias in news articles

## Running the Benchmark

### Basic Usage
Run the benchmark on a specific embedding model:

```bash
python run_benchmark.py --model all-MiniLM-L6-v2
```

This will evaluate the model on all the default datasets (pub, pstance, sbic) and save the results to a directory named after the model.

### Advanced Usage
Specify which datasets to evaluate:

```bash
python run_benchmark.py --model all-MiniLM-L6-v2 --datasets pub pstance
```

Specify a custom output directory:

```bash
python run_benchmark.py --model all-MiniLM-L6-v2 --output ./custom_results
```

Force CPU mode (useful if you have GPU issues):

```bash
python run_benchmark.py --model all-MiniLM-L6-v2 --cpu
```

Skip datasets that have already been evaluated (useful for resuming interrupted runs):

```bash
python run_benchmark.py --model all-MiniLM-L6-v2 --skip-existing
```

## Understanding Results
Results are saved in the output directory with the following structure:

```
results/
└── all-MiniLM-L6-v2/
    ├── pub_results.csv
    ├── pub_results.json
    ├── pstance_results.csv
    ├── pstance_results.json
    └── summary.json
```

Each dataset has its own CSV and JSON file with detailed metrics:
- Accuracy
- F1 score (macro, micro, weighted)
- Precision and recall
- Evaluation time

The `summary.json` file contains the aggregated results from all datasets.

You can also specify a custom output directory using the `--output` parameter:

```bash
python run_benchmark.py --model all-MiniLM-L6-v2 --output ./custom_results
```

## Evaluation Approach
The benchmark uses different approaches for evaluating embedding models based on task type:

### Classification Tasks
For standard classification tasks:
1. Load the embedding model using sentence-transformers
2. Encode text samples using the model
3. Train a logistic regression classifier on the embeddings
4. Evaluate performance on test set
5. Calculate and report metrics (accuracy, F1 scores, etc.)

### Zero-Shot Classification Tasks
For zero-shot classification tasks (like implicature recovery, figurative language understanding):
1. Load the embedding model using sentence-transformers
2. For each test sample:
   - Encode the input text
   - Encode all possible option texts
   - Compute cosine similarity between the input embedding and each option embedding
   - Predict the option with highest similarity
3. Calculate and report accuracy

### Pair Classification Tasks
For pair classification tasks (like agreement detection):
1. Load the embedding model using sentence-transformers
2. For each pair of texts:
   - Encode both texts
   - Compute cosine similarity between the embeddings
3. Find the optimal threshold that maximizes accuracy:
   - Test multiple thresholds from -1 to 1
   - For each threshold, convert similarities to binary predictions
   - Calculate accuracy for each threshold
   - Select threshold with highest accuracy
4. Report optimal threshold and corresponding metrics

This approach allows us to assess how well the embedding model captures the semantic distinctions necessary for each task.

## Adding New Models
To evaluate a new model, simply specify its name or path when running the benchmark:

```bash
python run_benchmark.py --model <model_name_or_path>
```

The model must be compatible with the sentence-transformers library.

## Setup

1. Install dependencies:

### Using pip
```bash
pip install -r requirements.txt
```

### Using conda
```bash
# Create a new conda environment
conda create -n implicit_embeddings python=3.11

# Activate the environment
conda activate implicit_embeddings

# Install dependencies
pip install -r requirements.txt

# Optional: deactivate the environment when done
# conda deactivate
```

2. Download datasets:
```bash
# Download PUB dataset
python scripts/download_pub.py

# Download P-Stance dataset
python scripts/download_pstance.py

# Download Social Bias Frames dataset
python scripts/download_sbic.py

# Download Implicit Hate Speech dataset
python scripts/download_implicit_hate.py
```

## License

[MIT License](LICENSE)