# Better Learning-Augmented Spanning Tree Algorithms via Metric Forest Completion

Forked from 'Approximate Forest Completion and Learning-Augmented Algorithms for Metric Minimum Spanning Trees' from prior work of Veldt et al. at [https://github.com/tommy1019/MetricForestCompletion](https://github.com/tommy1019/MetricForestCompletion)


The C++ experiment code is built using cmake:
```
mkdir build
cd build
cmake ..
cmake --build .
```

Each distance function is compiled to a seperate executable. The commands used to run the tests from the paper are found below.

```
./jaccard -i data/cooking.txt -e 0 -o out/jaccard_cooking_0.txt -a all_out/jaccard_cooking_0.txt > logs/jaccard_cooking_0.log  
```

```
./hdf5_784_dim_euclidean -i data/fashion-mnist-784-euclidean.hdf5 -o out/hdf5_fashion.txt -a all_out/hdf5_fashion.txt > logs/hdf5_fashion.log
```


```
./hamming_distance -i data/gg_13_5_ssualign_filtered.txt -o out/hamming_gg.txt -a all_out/hamming_gg.txt > logs/hamming_gg.log 
```

```
./edit_distance -i data/US_filtered.txt -o out/edit_distance_names_us.txt -a all_out/edit_distance_names_us.txt > logs/edit_distance_names_us.log 
```

Outputs on the datasets used in the paper can be found in the `results/multi_reps/out` folder. Plots used in the paper can be generates by running the plot_all script pointed at the `results/multi_reps/out/out` folder.

## Data

The cooking dataset is included under `data/cooking.txt`

Instructions for obtaining the other datasets are as follows:

The FashionMNIST data can be obtained by loading the hdf5 file from [https://github.com/erikbern/ann-benchmarks](https://github.com/erikbern/ann-benchmarks)

The Names-US dataset comes from the name-dataset repository [https://github.com/philipperemy/name-dataset](https://github.com/philipperemy/name-dataset). The repository includes a link to a full dataset that was downloaded. The last names in the Names-US subset of the larger dataset was preprocessed to generate a text tile (`US_filtered.txt`) where there was one last name per line.

The original GreenGenes dataset was obtained by following the instructions at [https://www.drive5.
com/usearch/benchmark_ggclust.html](https://www.drive5.com/usearch/benchmark_ggclust.html). The data was preprocessed into a text file where each line of the text file includes one aligned sequence. 


