# VecLink

Cross-embedding vector correspondence via iterative geometric embedding hashing. Given two sets of embeddings (from different models) over the same data that are partially overlapping, VecLink identifies which vectors correspond to the same underlying entities using only a small set of seed anchors.

## Installation

Requires Python 3.9–3.10.

```bash
# Install dependencies
uv sync

# PyTorch Geometric extensions (must be installed separately)
uv pip install torch-cluster torch-scatter torch-sparse torch-spline-conv \
  -f https://data.pyg.org/whl/torch-2.1.0+cu118.html
```

## Quick Start

```bash
uv run veclink.py \
  --dataset scifact \
  --emb1 mistral \
  --emb2 openai \
  --overlap_ratio 0.3 \
  --n_seeds 15 \
  --seed 42 \
  --use_bernoulli_trials
```

## Embeddings

Place embedding files in the `embeddings/` directory as NumPy `.npy` files, named as:

```
corpus_embeddings_{model}_{dataset}.npy
```

For example: `corpus_embeddings_mistral_scifact.npy`, `corpus_embeddings_openai_scifact.npy`.

## Key Arguments

| Argument | Default | Description |
|----------|---------|-------------|
| `--dataset` | `scifact` | Dataset name |
| `--emb1` / `--emb2` | `mistral` / `openai` | Embedding model names |
| `--overlap_ratio` | `0.3` | Fraction of data shared between the two sets |
| `--n_seeds` | `None` | Number of seed anchor pairs |
| `--use_bernoulli_trials` | `False` | Use posterior-based ensemble selection |
| `--max_iter` | `100` | Maximum refinement iterations |
| `--seed` | `None` | Random seed for reproducibility |
| `--use_gpu` | `True` | Enable GPU acceleration |

## Supported Datasets

BEIR benchmarks: scifact, scidocs, fiqa, nfcorpus, arguana.
