## Complex Numerical Reasoning  with Numerical Semantic Pre-training Framework

This is the official codebase of the State-of-the-Art **CNR-NST** framework for numerical complex query answering.


## Overview
![](figs/CNR-NST.png)


## Data Preparation
Download KG data (FB15k,DB15K, YAGO15K) from [here] [https://github.com/mniepert/mmkb] (https://github.com/mniepert/mmkb)
If you use the data, please also cite their paper.
And place them under folder `data/`. Go to `kbc/` folder to prepare KG data for KGE model 

## Graph Produce
We provide the complete query type sampling code. First, construct the knowledge graph by running `./Generate_Queries/preprocessing/graph_construction.py`, which will save the generated graph in the graphs folder.


## Prepare Pre-Training Data
Next, move the data generated in the `./Generate_Queries/preprocessing` folder to the `./kbc/data` folder for training the Multi-ComplEx model.


## Conbine sample data
Finally, run `../Generate_Queries/preprocessing/sample_multiprocessing.py` to perform query type sampling. This program utilizes multithreading for simultaneous sampling, significantly reducing the overall sampling time. (Queries containing the B operator need to be sampled separately.)


## Generate Queries
We run the `sample.py` program to merge multiple JSON data files into a single query .pkl file. Afterward, move the three .pkl files into the corresponding dataset folders within the data directory.

## Wash Queries
We run the `wash_queries.py` program to further refine and optimize the query structure.

## Pretrain KGE
CNR-NST requires a pre-trained knowledge graph embedding (KGE) model to perform complex numerical query answering. We used the KGE implementation from ssl-relation-prediction.
If you wish to train the KGE (Multi-ComplEx) model on the three public datasets, please run the following command in the `kbc/` directory.




**FB15K**
```
CUDA_VISIBLE_DEVICES=0 nohup python -u src/main.py --dataset FB15K --score_rel True --model ComplEx --rank 500 --learning_rate 0.05 --batch_size 10000 --lmbda 0.05 --w_rel 4 --max_epochs 100 > output_complex_FB.log 2>&1 &
```
**DB15K**
```
CUDA_VISIBLE_DEVICES=0 nohup python -u src/main.py --dataset DB15K --score_rel True --model ComplEx --rank 500 --learning_rate 0.05 --batch_size 10000 --lmbda 0.05 --w_rel 4 --max_epochs 100 > output_complex_DB.log 2>&1 &
```
**FB15K**
```
CUDA_VISIBLE_DEVICES=0 nohup python -u src/main.py --dataset YAGO15K --score_rel True --model ComplEx --rank 500 --learning_rate 0.05 --batch_size 10000 --lmbda 0.05 --w_rel 4 --max_epochs 100 > output_complex_YAGO.log 2>&1 &
```

## Numerical Query Answering with CNR-NST
The commands we provide can reproduce the results of our complex numerical reasoning framework. Please note that in the final step, the `--kbc_path` parameter should be followed by the actual path of the pre-trained numerical semantic learning model. The "fraction" parameter is used to divide the neural adjacency matrix into segments, allowing each segment to be stored as a dense matrix on the GPU during computation. If GPU memory is insufficient, increasing the `fraction` size can help reduce memory usage.

The command will first utilize the pre-trained KGE model (stored in the kbc/{dataset}/ directory) to compute the neural adjacency matrix and save it in the `neural_adj` folder.
The `use_newmetric` parameter determines whether new metrics will be used to evaluate the numeric answers.


**FB15K**
```      
CUDA_VISIBLE_DEVICES=0  nohup python -u main.py --data_path data/FB15k-number --kbc_path FB15K/best_valid.model --fraction 10 --neg_scale 6 > output_FB15K.log 2>&1 &
```
**DB15K**
```      
CUDA_VISIBLE_DEVICES=0 nohup python -u main.py --data_path data/DB15k-number --kbc_path DB15K/best_valid.model --fraction 10 --neg_scale 6 > output_DB15K.log 2>&1 &
```
**YAGO15K**
```      
CUDA_VISIBLE_DEVICES=0 nohup python -u main.py --data_path data/YAGO15k-number --kbc_path YAGO15K/best_valid.model --fraction 10 --neg_scale 50 > output_YAGO15K.log 2>&1 &
```


