# Metric Forest Completion

This supplement contains the source code used to generate the results and plots in the following paper:
<br>

**Better Learning-Augmented Spanning Tree Algorithms via Metric Forest Completion** [[openreview](https://openreview.net/forum?id=TWmS4o41oA)] 

Nate Veldt, Thomas Stanley, Benjamin W. Priest, Trevor Steil, Keita Iwabuchi, T.S. Jayram, Grace J. Li, Geoffrey Sanders


For completeness and comparison, this supplement also contains code to reproduce experiments from an earlier paper, which the ICLR 2026 builds upon:

<br>
**Approximate Forest Completion and Learning-Augmented Algorithms for Metric Minimum Spanning Trees** [[icml](https://icml.cc/virtual/2025/poster/43905)] [[openreview](https://openreview.net/forum?id=r9Yp0v8toA)] [[arxiv](https://arxiv.org/abs/2502.12993)]

Nate Veldt, Thomas Stanley, Benjamin W. Priest, Trevor Steil, Keita Iwabuchi, T.S. Jayram, Geoffrey Sanders

<br>

## Compiling

The C++ experiment code is build using cmake:
```
mkdir build
cd build
cmake ..
cmake --build .
```

Each distance function is compiled to a seperate executable. 

## Running

The commands used to run the tests for the papers are found below.

```
./jaccard -i data/cooking.txt -e 0 -o out/jaccard_cooking_0.txt -a all_out/jaccard_cooking_0.txt > logs/jaccard_cooking_0.log  
```

```
./hdf5_784_dim_euclidean -i data/fashion-mnist-784-euclidean.hdf5 -o out/hdf5_fashion.txt -a all_out/hdf5_fashion.txt > logs/hdf5_fashion.log
```


```
./hamming_distance -i data/gg_13_5_ssualign_filtered.txt -o out/hamming_gg.txt -a all_out/hamming_gg.txt > logs/hamming_gg.log 
```

```
./edit_distance -i data/US_filtered.txt -o out/edit_distance_names_us.txt -a all_out/edit_distance_names_us.txt > logs/edit_distance_names_us.log 
```

Output is generated is csv format and contains results for both papers. The `RunType` column identifies what algorithm was used to get the results for each row. A run type of `simple` indicates the algorithm used in the original paper.

Example outputs from running the programs on the datasets used in the ICLR 2026 paper can be found in the `results/multi_reps` folder. Scripts used to plot the figures in the ICLR 2026 paper can be found in the `plotting/multi_reps` folder. All plots used in the ICLR 2026 paper can be generated by running the following command in the `plotting/multi_rep` directory.
```
./plot_all.sh ../../results/multi_reps/out
```

Example outputs and plotting scripts for the ICML 2025 paper can be found in the `results/simple` and `plotting/simple` folders. Note the output data format used in the ICML 2025 paper differs for the current output format so the old data can only be used with the old scripts and the old scripts cannot be used on the new data. This data is kept for archive purposes. 

## Datasets

The preprocessed dataset for the Cooking dataset can be found in the `data` directory. Each data object is a set of food ingredients defining a recipe. There are 6714 ingredients and 39774 recipes. The original dataset comes from the What’s Cooking? Kaggle challenge [[source](https://www.kaggle.com/c/whats-cooking)].

The Kosarak dataset is derived from click-stream data from a Hungarian news portal [[source](https://fimi.uantwerpen.be/data/)]. The original data comprises 990002 sets defined over a collection of 41270 items. We restricted to sets of size at least 40, leading to a set of 32295 sets.

The Movelens datset is a set of sets derived from movie ratings [[source](https://grouplens.org/datasets/movielens/)]. We consider a subset of the data provided on the [ANN Benchmarks Repository](https://github.com/erikbern/ann-benchmarks), restricting to sets with 64 items ormore, in order to work with a dataset where n≈30000.

The FashionMNIST data can be obtained by loading the hdf5 file from [https://github.com/erikbern/ann-benchmarks](https://github.com/erikbern/ann-benchmarks)

The Names-US dataset comes from the name-dataset repository [https://github.com/philipperemy/name-dataset](https://github.com/philipperemy/name-dataset). The repository includes a link to a full dataset that was downloaded. The last names in the Names-US subset of the larger dataset was preprocessed to generate a text tile (`US_filtered.txt`) where there was one last name per line.

The original GreenGenes dataset was obtained by following the instructions at [https://www.drive5.com/usearch/benchmark_ggclust.html](https://www.drive5.com/usearch/benchmark_ggclust.html). This link is no longer active, the GreenGenes dataset can be obtained at [https://greengenes.lbl.gov/Download/](https://greengenes.lbl.gov/Download/). The data was preprocessed into a text file where each line of the text file includes one aligned sequence. 


