Format of Input Data
--------------------

DGL-KE toolkits provide commands for training, evaluation and inference. Different commands require different kinds of input data, including:

  * **Knowledge Graph** The knowledge graph used in train, evaluation and inference.
  * **Trained Embeddings** The embedding generated by dglke_train or dglke_dist_train.
  * **Other Data** Extra input data that used by inference tools.

A knowledge graph is usually stored in the form of triplets (head, relation, tail). Heads and tails are entities in the knowledge graph. All of them can be identified with unique IDs. In general, there exists two types of IDs for entities and relations in DGL-KE:

  * **Raw ID** The entities and relations can be identified by names, usually in the format of strings.
  * **KGE ID** They are used during knowledge graph training, evaluation and inference. Both entities and relations are identified with integers and should start from 0 and be contiguous.

If the input file of a knowledge graph uses Raw IDs for entities and relations and does not provide a mapping between Raw IDs and KGE Ids, DGL-KE will generate an ID mapping automatically.

The following table gives the overview of the input data for different toolkits. (Y for necessary and N for no-usage)

+------------------+----------+------------+---------------------+----------------+
|  DGL-KE Toolkit  | Knowledge Graph       | Trained Embeddings  |   Other Data   |
+==================+==========+============+=====================+================+
|                  | Triplets | ID Mapping |     Embeddings      |                |
+------------------+----------+------------+---------------------+----------------+
| dglke_train      |    Y     |     Y      |         N           |       N        |
+------------------+----------+------------+---------------------+----------------+
| dglke_eval       |    Y     |     N      |         Y           |       N        |
+------------------+----------+------------+---------------------+----------------+
| dglke_partition  |    Y     |     Y      |         N           |       N        |
+------------------+----------+------------+---------------------+----------------+
| dglke_dist_train |           Use data generated by dglke_partition              |
+------------------+----------+------------+---------------------+----------------+
| dglke_predict    |    N     |     Y      |         Y           |       Y        |
+------------------+----------+------------+---------------------+----------------+
| dglke_emb_sim    |    N     |     Y      |         Y           |       Y        |
+------------------+----------+------------+---------------------+----------------+


Format of Knowledge Graph Used by DGL-KE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

DGL-KE support three kinds of Knowledge Graph Input:

  * **Built-in Knowledge Graph** Built-in knowledge graphs are preprocessed datasets provided by DGL-KE package. There are five built-in datasets: FB15k, FB15k-237, wn18, wn18rr, Freebase.
  * **Raw User Defined Knowledge Graph** Raw user defined knowledge graph dataset uses the Raw IDs. Necessary ID convertion is needed before training a KGE model on the dataset. ``dglke_train``, ``dglke_eval`` and ``dglke_partition`` provides the basic ability to do the ID convertion automatically.
  * **KGE User Defined Knowledge Graph** KGE user defined knowledge graph dataset already uses KGE IDs. The entities and relations in triplets are integers.

Format of Built-in Knowledge Graph
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

DGL-KE provides five built-in knowledge graphs:

+------------+------------+-------------+------------+
| Dataset    | #nodes     | #edges      | #relations |
+============+============+=============+============+
| FB15k      | 14,951     | 592,213     | 1,345      |
+------------+------------+-------------+------------+
| FB15k-237  | 14,541     | 310,116     | 237        |
+------------+------------+-------------+------------+
| wn18       | 40,943     | 151,442     | 18         |
+------------+------------+-------------+------------+
| wn18rr     | 40,943     | 93,003      | 11         |
+------------+------------+-------------+------------+
| Freebase   | 86,054,151 | 338,586,276 | 14,824     |
+------------+------------+-------------+------------+

Each of these built-in datasets contains five files:

 * train.txt: training set, each line contains a triplet [h, r, t]
 * valid.txt: validation set, each line contains a triplet [h, r, t]
 * test.txt: test set, each line contains a triplet [h, r, t]
 * entities.dict: ID mapping of entities
 * relations.dict: ID mapping of relations

Format of Raw User Defined Knowledge Graph
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Raw user defined knowledge graph dataset uses the Raw IDs. The knowledge graph can be stored in a single file (only providing the trainset) or in three files (trainset, validset and testset). Each file stores the triplets of the knowledge graph. The order of head, relation and tail can be arbitry, e.g. [h, r, t]. A delimiter should be used to seperate them. The recommended delimiter includes ``\t``, ``|``, ``,`` and ``;``. The ID mapping is automatically generated by DGL-KE toolkits for raw user defined knowledge graphs.

One extra column can be provided to store edge importance score value after ``hrt``. The importance score can be used in training to highlight the importance of positive edges. The scores should be floats.

Following gives an example of Raw User Defined Knowledge Graph files:

train.txt::

    "Beijing","is_capital_of","China"
    "Pairs","is_capital_of","France"
    "London","is_capital_of","UK"
    "UK","located_at","Europe"
    "China","located_at","Asia"
    "Tokyo","is_capital_of","Japan"


valid.txt::

    "France","located_at","Europe"


test.txt::

    "Japan","located_at","Asia"

Following gives an example of Raw User Defined Knowledge Graph with edge importance score.

train.txt::

    "Beijing","is_capital_of","China",2.0
    "Pairs","is_capital_of","France",1.0


Format of User Defined Knowledge Graph
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

User Defined Knowledge Graph uses the KGE IDs, which means both the entities and relations have already been remapped. The entity IDs and relation IDs are both **start from 0 and be contiguous**. The knowledge graph can be stored in a single file (only providing the trainset) or in three files (trainset, validset and testset) along with two ID mapping files (one for entity ID mapping and another for relation ID mapping). The knowledge graph is stored as triplets in files. The order of head, relation and tail can be arbitry, e.g. [h, r, t]. A delimiter should be used to seperate them. The recommended delimiter includes ``\t``, ``|``, ``,`` and ``;``. The ID mapping information is stored as pairs in mapping files with pair[0] as the integer ID and pair[1] as the original raw ID. The ``dglke_train`` and ``dglke_dist_train`` will do some integrity check of the IDs according to the mapping files.

One extra column can be provided to store edge importance score value after ``hrt``. The importance score can be used in training to highlight the importance of positive edges. The scores should be floats.

Following gives an example of User Defined Knowledge Graph files:

train.txt::

    0,0,1
    2,0,3
    4,0,5
    5,1,6
    1,1,7
    8,0,9

valid.txt::

    3,1,6

test.txt::

    9,1,7

Following gives an example of Raw User Defined Knowledge Graph with edge importance score.

train.txt::

    0,0,1,2.0
    2,0,3,1.0

Following gives an example of entity ID mapping file:

entities.dict::

    0,"Beijing"
    1,"China"
    2,"Pairs"
    3,"France"
    4,"London"
    5,"UK"
    6,"Europe"
    7,"Asia"
    8,"Tokyo"
    9,"Japan"

Following gives an example of relation ID mapping file:

relations.dict::

    0,"is_capital_of"
    1,"located_at"

Format of Trained Embeddings
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The trained embeddings are generated by ``dglke_train`` or ``dglke_dist_train`` CMD. The trained embeddings are stored in npy format. Usually there are two files:

  * **Entity embeddings** Entity embeddings are stored in a file named in format of dataset_name>\_<model>\_entity.npy and can be loaded through numpy.load().
  * **Relation embeddings** Relation embeddings are stored in a file named in format of dataset_name>\_<model>\_relation.npy and can be loaded through numpy.load()

Format of Input Data Used by DGL-KE Inference Tools
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Both ``dglke_predict`` and ``dglke_emb_sim`` require user provied list of inferencing object.

Format of Raw Input Data
^^^^^^^^^^^^^^^^^^^^^^^^^

Raw Input Data uses the Raw IDs. Thus the input file contains objects in raw IDs and necessary ID mapping file(s) are required. Each line of the input file contains only one object and it can contains multiple lines. The ID mapping file store mapping information in pairs with pair[0] as the integer ID and pair[1] as the original raw ID.

Following gives an example of raw input files for ``dglke_predict``:

head.list::

    "Beijing"
    "London"

rel.list::

    "is_capital_of"

tail.list::

    "China"
    "France"
    "UK"

entities.dict::

    0,"Beijing"
    1,"China"
    2,"Pairs"
    3,"France"
    4,"London"
    5,"UK"
    6,"Europe"

relations.dict::

    0,"is_capital_of"
    1,"located_at"

Format of KGE Input Data
^^^^^^^^^^^^^^^^^^^^^^^^

KGE Input Data uses the KGE IDs. Thus the input file contains objects in KGE IDs, i.e., intergers. Each line of the input file contains only one object and it can contains multiple lines.

Following gives an example of raw input files for ``dglke_predict``:

head.list::

    0
    4

rel.list::

    0

tail.list::

    1
    3
    5
