# Knowledge Graph Embedding (KGE) Training Library

A library for training Knowledge Graph Embeddings with full control over randomness sources, largely inspired by [LibKGE](https://github.com/uma-pi1/kge).


## Key Features

- **Four Independent Randomness Sources**: Complete control and independence over initialization, triple ordering, negative sampling, and dropout
- **Multiple KGE Models**: Support for TransE, DistMult, ConvE, RGCN, and Transformer architectures
- **Stability Analysis**: Computation of metrics to measure model stability across different random seeds
- **Hyperparameter Optimization**: Integration with Weights & Biases for sweep experiments


## Code Organization

```
├── data/                     # Dataset directory
├── main.py                   # Main entry point for training and experiments
├── kge/                      # Core KGE library
│   ├── models.py             # KGE model implementations (TransE, DistMult, ConvE, etc.)
│   ├── train.py              # Training loop and optimization
│   ├── data.py               # Data loading and preprocessing utilities
│   ├── eval.py               # Evaluation metrics and procedures
│   └── utils.py              # Utility functions and seed management
├── stability.py              # Stability experiment orchestration
├── training_utils.py         # Model initialization and training utilities
├── sweep_utils.py            # Hyperparameter sweep utilities
├── stability_measures/       # Stability analysis scripts and results
│   ├── stability_measures.py # Stability analysis script
│   ├── stability_measures_predictions.py  # For predictions metrics
│   └── stability_measures_space.py  # For space metrics
└── tests/...                  # Test suite
```

## Testing

Run the comprehensive test suite with pytest:

```bash
pytest tests/
```

### Test Categories

- **`test_seeds_MODELS.py`**: Verify that all randomness sources are reproducible and distinct for each model (TransE, DistMult, ConvE, Transformer)
- **`test_checkpointing.py`**: Ensure training can be resumed while maintaining reproducibility of random states
- **`test_reprod_train.py`** & **`test_reprod_sampler.py`**: Validate reproducibility of training procedures and negative samplers
- **`test_stability_space_equivalence.py`**: Comfirm that the optimised (and ugly) space metrics are equivalent to the original space metrics
- **Additional tests**: Non-critical, but give a nice dopamine hit when green.


## Usage

### Prerequisites

Install dependencies:
```bash
pip install -r requirements.txt
```


### Example of  training with custom seed configuration

```bash
python3 main.py \
    --data_dir data/nations \
    --model DistMult \
    --seed_init 42 \
    --seed_neg 123 \
    --seed_order 456 \
    --seed_forward 789 \
    --use_gpu \
    --no-log_to_wandb
```


### Protocole from the paper to have results on Nations dataset with DistMult

#### 1. Hyperparameter Tuning

Run hyperparameter optimization using Weights & Biases:

```bash
python3 main.py --sweep_id=$SWEEP_ID --data_dir data/nations --model DistMult --use_gpu --GPU_reproductibility
```

if you don't have a sweep_id, you can use the sweep_luncher.sh script, to create one:

```bash
./sweep_luncher.sh
```

#### 2. Stability Training

Run multiple training sessions with different seeds to assess model stability:

```bash
python3 main.py --data_dir data/nations --model DistMult --use_gpu --GPU_reproductibility --stability_training --oar
```

**Options:**
- `--oar`: Launch parallel runs on OAR cluster
- Without `--oar`: Run sequentially on local machine

#### 3. Calculate Stability Metrics

Compute stability measures from multiple training runs:

```bash
python3 main.py --data_dir data/nations --model DistMult --stability_measures
```


