# Breaking Rank Bottlenecks in Knowledge Graph Embeddings

TL;DR: To reproduce the results from the paper, run `scripts/prepare_experiments.sh` to download all the data and prepare all the experiment scripts (keeping the project root as your active working directory).
Then, run the experiment scripts that you will find under `scripts/experiments/`.
Make sure to set up your [wandb](https://docs.wandb.ai/quickstart) account and project under the main config file `config/config.yaml`, and to have [uv](https://docs.astral.sh/uv/getting-started/installation/) installed for dependency management.


## Data
### TSV-formatted KGs
Download, e.g., **FB15K-237** or **WN18RR** from [here](https://github.com/villmow/datasets_knowledge_embedding).

And put it in a source directory:
```bash
data/
|-- src/
|   |-- FB15K-237/
|       |-- train.tsv
|       |-- valid.tsv
|       |-- test.tsv
```

Then process with
```bash
uv run scripts/preprocess.py --datasets FB15K-237 WN18RR --data-folder data/src --output-folder data/processed
```
The preprocessing script will ensure that the entities and relations are mapped to consecutive integers starting from 0.

The same can be done for other datasets with tab-separated files following the same format.

### Hetionet and openbiolink
Download **Hetionet** and **openbiolink** and dump the processed tsv files by running
```bash
uv run data/download_hetionet.py --output_dir data/processed/Hetionet --seed 42
uv run data/download_openbiolink.py --output_dir data/processed/openbiolink
```
This uses [PyKEEN](https://pykeen.readthedocs.io/en/stable/reference/datasets.html) to download the datasets and split them into train/validation/test sets. The original datasets do not come with standard train/validation/test splits, so we use PyKEEN and a fixed seed to split them for reproducibility.

### OGBL-biokg
For **ogbl** datasets, no need to download the data, just run the training scripts and the data will be downloaded automatically.

## Running models

Run with
```bash
uv run scripts/main.py data_folder=data/processed dataset=FB15K-237 model=tail dimension=200 model/fusing_function=hadamard model.fusing_dropout=0.1 engine_config.learning_rate=1e-3 device=cuda:0
```
Choose the datasets from `FB15K-237`, `WN18RR`, `ogbl-biokg`, `Hetionet`, `openbiolink`, and `ogbl-wikikg2`.
* Note that `ogbl-wikikg2` is very large and you'll need a lot of resources to run it. We did not use it in the paper due to computational constraints.
* For `Hetionet` and `openbiolink`, we highly recommend setting `engine_config.valid_sample_size=10000` or a similar small number due to the datasets being large (unlike `FB15K-237` and `WN18RR`) and not having an efficient validation/testing setup (unlike `ogbl-biokg` or `ogbl-wikikg2`). The whole test sets will be used for evaluation at the end of training (can take several hours).
* The evaluation code assumes datasets include inverse relationships (add_inverse=True). This means that head prediction is evaluated through tail prediction on inverse triples. For example, to evaluate "given (r,o), predict s" on triple (s,r,o), the code evaluates "given (o,r_inv), predict s" on the inverse triple (o,r_inv,s).

Set up the config as you like by modifying `config/config.yaml` (type of model, embedding dimension, learning rate, etc.) as well as the sub-config files under `config/model` (settings for the specific model).

Under `config/config.yaml`, you can specify the family of model to use by setting the `model` field:
```yaml
defaults:
  - model: mixture # pipeline, mixture, pykeen
...
```
The model types are:
- `tail_model`: regular KGE models including DistMult, ComplEx, ConvE, etc., which do tail prediction using matrix multiplications. You can specify the one to use by setting the `config/model/tail_model.yaml` file.
- `tail_mixture`: high-rank variations of the KGE models expanded with a mixture layer. You can specify the one to use by setting the `config/model/mixture.yaml` file.
- `pykeen`: some default implementations of DistMult, ComplEx, ConvE from PyKEEN for reference. You can specify the one to use by setting the `config/model/pykeen.yaml` file. In the end, we did not use these models for our experiments because the way they compute object scores (repetitions and element-wise product) is much less efficient than other implementations (matrix multiplications).
- `gnn_kge` and `gnn_mixture`: used for `CompGCN` with and without mixture layer, respectively.

In `tail_model.yaml` and `tail_mixture.yaml`, you can specify the embedding type, fusing function, and grammatical encoder to use.
Separating the score function into a "grammatical encoder" and "fusing function" is legacy, and we will soon simplify this by having only a "score function" object.
Here are some common configurations:
* DistMult:
  * embedding: real
  * fusing_function: hadamard # element-wise product
  * grammatical_encoder: identity
* ComplEx:
  * embedding: complex
  * fusing_function: complex
  * grammatical_encoder: identity # used to be `complex`, but is now legacy code
* ConvE:
  * embedding: real
  * fusing_function: conve
  * grammatical_encoder: identity
`tail_model` models then project to the objects using a linear dot product, whereas `mixture` models use a more elaborate function using a mixture layer.

## Aggregating results

`results/` contains undocumented code which can be useful to aggregate results from different runs on wandb.

## Running the HPO with WandB sweeps
### Run sweep

Create a sweep from a yaml file.
```sh
wandb sweep config/sweep.yaml
```

Retrieve the sweep id from the command output and run sweep agents on different GPUs. e.g., with 4 GPUs:
```sh
nohup env CUDA_VISIBLE_DEVICES=0 wandb agent [sweep-id] > agent0.log &
nohup env CUDA_VISIBLE_DEVICES=1 wandb agent [sweep-id] > agent1.log &
nohup env CUDA_VISIBLE_DEVICES=2 wandb agent [sweep-id] > agent2.log &
nohup env CUDA_VISIBLE_DEVICES=3 wandb agent [sweep-id] > agent3.log &
```

### Cancel sweep

Cancel a sweep.
```sh
wandb sweep [sweep-id] --cancel
```
