# MoVE: Mixture-of-Vocabulary-Experts

A PyTorch implementation of Mixture-of-Vocabulary-Experts (MoVE) BERT for learning dense text embeddings, with support for various MoE routing strategies.

## Architecture

The model consists of:

1. **MoEBertEncoder**: Core encoder with configurable MoE layers
2. **CustomEmbeddingEncoder**: Adds embedding projection and normalization
3. **SiameseEncoder**: Dual encoder for query-passage training

## Installation

```bash
# Install dependencies
pip install torch accelerate transformers pandas scipy "torch<2.5" tokenmonster mteb==1.24.0 
```

## Usage

### Training

Run `accelerate config` and use the provided training script with distributed training:

```bash
# Basic training command
accelerate launch --config_file config.yaml main.py \
    --batch_size 256 \
    --lr 5e-5 \
    --embedding_dim 768 \
    --num_experts 2000 \
    --moe_type hash \
    --hash_list_path /path/to/hash_lists/balance_hash_bucket_2000_200k.pkl

# Or use the training script
bash train.sh
```

### Key Training Parameters

- `--num_experts`: Number of experts in MoE layers
- `--intermediate_size_expert`: Hidden size of each expert
- `--hash_list_path`: Path to pre-computed hash balance file (for hash routing)

### Evaluation

Evaluate trained models on MTEB tasks:

```bash
python mteb_eval.py \
    --load_model /path/to/model.pt \
    --batch_size 1024 \
    --output_dir /path/to/results

# Or use evaluation script
bash eval.sh
```

### Result Processing

Parse evaluation results into CSV format:

```bash
python result_parser.py \
    --folder_path /path/to/results \
    --output_csv_path results.csv
```

## Configuration

### Model Configuration

Key configuration parameters in `utils.py`:

```python
# Model Architecture
embedding_dim: 768          # Output embedding dimension
hidden_size: 768           # Hidden size of BERT layers
num_hidden_layers: 6       # Number of transformer layers
num_attention_heads: 12    # Number of attention heads

# MoE Configuration  
num_experts: 2000         # Number of experts
num_sparse_layers: 3      # Number of MoE layers
moe_type: "hash"         # Routing strategy
topk: 1                  # Top-k experts (for topk routing)
intermediate_size_expert: 43  # Expert hidden dimension
```

### Dataset Configuration

The model expects memory-mapped datasets with:
- Query and passage token IDs
- Attention masks
- Query-passage relevance labels (qrels)

Dataset info files specify:
- Maximum number of queries/passages
- Sequence lengths
- File paths

## Hash-Based Routing

For hash-based routing, pre-compute balanced hash assignments:

```bash
python create_hash_balance.py \
    --num_experts 2000 \
    --vocab_size 30003 \
    --output_path balance_hash_bucket_2000_200k.pkl
```