Supplemental material for the paper "Bridging ML and algorithms: comparison of hyperbolic embeddings"

After the paper is accepted, full data will be available on a repository such as FigShare.

Projects included:
==================

hyperbolic-embedder: from https://bitbucket.org/HaiZhung/hyperbolic-embedder/overview (commit 3ade6d7d67188b0ab82949397ea5da62e4d9c845, 2018-05-02)
poincare-embeddings: from https://github.com/facebookresearch/poincare-embeddings (commit ff1d846db3a64a759e56173d7846c164a37654f9, 2021-09-16)
hyperrogue/DHRG: from https://github.com/zenorogue/hyperrogue (commit 5a33967711b017c1453d108ffeeb18d1cf912c6d, 2023-04-01)
mercator: from https://github.com/networkgeometry/mercator (commit a5dd4a05f4d77f92c32ee7750efd450cee0d3014, 2022-06-21)
TreeRep: from https://github.com/rsonthal/TreeRep (commit 8ed4d830b5d0da41aeecf786d5be650ed75b8d59, 2023-06-22)
HyperbolicTiling_Learning: from https://github.com/ydtydr/HyperbolicTiling_Learning (commit c77f0d1a1b32ed5437a59d7cdeb8426ff03ea70b, 2020-03-19)
hypviewer: from https://graphics.stanford.edu/~munzner/h3/download.html (not git -- last modified in 2003)
d-mercator: from https://github.com/networkgeometry/d-mercator (commit 45e71880b8744ba4e9e87c3db5e6675551cbf200, 2023-10-17)
kvk: from https://bitbucket.org/dk-lab/2020_code_hyperlink/src/master/ (commit 31b65ce7545e226f62dcfeddb04043aeb149866d, 2021-06-11)
coalescent_embedding: from https://github.com/biomedical-cybernetics/coalescent_embedding (commit 842a1dae06f8bb4a7c26a7d64f25285f55d4c1e6, 2019-08-07)
hypCLOVE: from https://github.com/samu32ELTE/hypCLOVE (commit dbab9523af758500ef2e76c1aed6b1a0db4444fb, 2025-10-16)
LPCS: from https://www.sciencedirect.com/science/article/pii/S0378437116000182

See `diffs` for the changes from the original commits listed above. We have done: (also some files included in the repo such as datasets and helper tools are not removed)

In hyperbolic-embedder:
- fix a compilation error on newer C++

In poincare-embeddings:
- add a CLI option `-initial` (not actually used in the final paper)
- create kx-evaluate.py, mostly to evaluate embeddings using MAP
  (including BFKL embedding, but as mentioned in the paper, it did not work due to numerical precision errors) and export embeddings to a format recognized by DHRG
- create wordnets/transitive_closure_verb.py to export the verb hierarchy by analogy to wordnets/transitive_closure.py
- an option to change the seed via the `SEED` environmental variable (not discussed in the paper)

In hyperrogue/dhrg:
- some irrelevant files (e.g., music) were removed
- create maprank.cpp, which is a computation of mAP and MeanRank
- create compare.cpp for various analyses (load distance tables, Poincare 2D and 3D embeddings)
- various minor changes to access necessary tools (access compute-map.cpp and dhrg/routing via commandline, access landscape from dhrg, etc.)
- code to simplify the visualization output, and some fixes to visualization

In HyperbolicTiling_Learning:
- remove HalfspaceManifold which was referred to but seems not actually present in the repo; also train-grqc.py refers to `group_rie` which is not available
- added an option to produce a table of distances that can be analyzed using other tools (specifically we use dhrg)

In TreeRep:
- implemented `experiment.jl` which runs the experiments and outputs

In kvk:
- the embedder accedts the nodes to be indexed with numbers -- adapted the embedder to work with more general node names
- implemented `bridging-hyperlink.sh` script which runs the experiments and outputs

In coalescent_embedding:
- we have been running this embedder using GNU Octave instead of Matlab. However, Octave does not include the `graphallshortestpaths` function,
  so we have writen a C program scripts/graph-to-matlab.cpp which takes a graph on input and produces the file `coalescent_embedding/usage_example/graphallshortestpaths.m`
  containing the shortest path matrix for the given graph.
- octave with packages statistics and symbolic should be installed (in octave, run `pkg install -forge symbolic statistics`

In mercator/d-mercator:
- the version of Eigen included does not compile in newer g++ -- due to a template function which does not seem to be used, so it is commented out

In LPCS:
- added `louvain-community.R` which performs the first step, as explained in readme.txt
- some other changes
- fixed some warnings reported by Octave
- made it work when there are only two communities

Since the official implementation ran very slow in Octave, we have also reimplemented the algorithm in C++ (script/lpcs-remake.cpp).
Our reimplementation fixes a bug in ConnectNextCom.m (which compares the intimacies of x(1) and x(2), while, according to the paper,
intimacies of the first and last subcommunity in x should be compared here).

In hypCLOVE:
- added `import-clove.py` which read the graphs and calls hypCLOVE proper
- changed hypCLOVE to work even if estimated b is greater than 1 (by assuming 1 instead)

Datasets included: (graphs/*/graph-orig.txt)
astroph, condmat, grqc, hepph, facebook: from http://snap.stanford.edu/data/
brain maps data: https://github.com/networkgeometry/navigable_brain_maps_data

Our setup and compute
=====================

Hardware:
[1] Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz, NVIDIA GeForce GTX 1060 6GB/PCIe/SSE2
[2] 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz, OpenGL renderer string: NVIDIA RTX A3000 Laptop GPU/PCIe/SSE2

Software: Arch Linux, g++, older experiments: 12.2.1, newer experiments: 15.2.1

The times reported in the paper have been obtained on [1]. Some experiments have been run on [2].

The total time of all computations (as reported in log.txt) is about 2700 hours. There were some computations that were aborted (due to bugs) but they were a minority.

Files omitted
=============

Due to the filesize limits for the supplemental material, we had to omit a part of files. These include:

- the results of TreeRep (disttables 1856 MB + treerep 7593 MB)
- embeddings generated by Poincare/Lorentz (2340 MB)
- Euclidean embeddings (50D: 69 MB, 200D: 216 MB)
- a part of data regarding repeated experiments (168 MB)
- hypviewer embeddings (27 MB), rogueviz coordinates (5 MB)
- links files (721 MB)
- most files regarding simulated networks, except time logs and summaries (751 MB)
- experiments on other kinds of simulated networks which are not included in the paper (277 MB)

< note: sizes not updated for the current revision >

How to reproduce:
=================

Note: scripts are designed to be called from the main directory. (e.g. `bash scripts/compile-all.sh` not `cd scripts; bash compile-all.sh`)

- create and activate the poincare environment, as described in poincare-embeddings/README.org
- compile all included projects (`bash scripts/compile-all.sh`)
- convert graphs/*/graph-orig.txt to the correct formats graphs/*/graph.txt and graphs/*/graph.csv (`bash scripts/read-networks.sh`)
- create WordNet hierarchies and convert them to the correct formats (`bash scripts/hierarchies.sh`)
- perform the experiments (for example: `bash scripts/process.sh -bfkl-embed-bfkl-eval- name`, replacing `name` with every graph in graphs; give the steps from process.sh as required)
- build the tables `tables/*.tex` using `bash scripts/generate-rw-table.sh` and `bash scripts/generate-rep-table.sh`
- generate simulated networks: `bash scripts/sim-generate.sh`
- compute the experiments on simulated networks: `STEPS=[...] bash scripts/simulate.sh` (replace [...] with wanted steps)
- build the CSV data `tables/statistical-data.csv` using `bash scripts/generate-for-table.sh`
- compute the precise BFKL time data `tables/precise-times.csv` using `bash scripts/compute-precise-simulated-times.sh bp`
- scripts/analysis.R was used to create the graphs and statistical analysis

Explanation of files:
=====================

In graphs/[graph name]:
- `graph-orig.txt`: original data from the source
- `graph.txt`: data in the BFKL format
- `graph.csv`: data in the poincare-embeddings format
- `notrans.txt`: for hierarchies, edges without transitive closure (created by `scripts/create-notrans.sh`, used by the visualizer `scripts/visualize.sh`)

In results/[graph name]:
- `log.txt` logs all times
- `log-*.txt` contain the output of various steps
- `*.bin` and `*.bin.best` are embeddings obtained from poincare-embeddings
- `*-coordinates.txt` are various embeddings; BFKL also produces `bfkl-links.txt` which is used by other dhrg evaluations for convenience
- `*-dhrg.txt` are these embeddings improved by BFKL

Extra explanation:
==================

This code & data appendix includes some methods and graphs which are not explained in the main paper. For completeness, we explain them here.

The 'landscape' embedding method is a method of transforming hyperbolic embeddings into high-dimensional Euclidean embeddings. It is based on the 'landscape' method from [1].
We have devised this method to improve on the very low results on Euclidean embeddings reported by [Nickel and Kiela 2017]; while our method did indeed get better
results, it turned out that the Euclidean embedder from [Nickel and Kiela 2017] actually gets much better results than reported (and also the landscape results), 
thus reducing the signifance of our result.

The 'rogueviz' embedding method is the basic hierarchy embedder implemented in the RogueViz library.

The 'sim3' graphs are three-dimensional artificially constructed networks. While MAP/MeanRank/greedy quality measures will always favor higher-dimensional embeddings,
we expect that ICV will favor the actual dimension, e.g., 2 for two-dimensional artificial networks and 3 for three-dimensional artificial networks. However, this is
beyond the scope of the paper under review.

[1] Non-Euclidean Self-Organizing Maps. Dorota Celińska-Kopczyńska, Eryk Kopczyński. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22).
https://www.ijcai.org/proceedings/2022/0269.pdf

