Topological optimization of the ICLR acronym into 4 clusters

In this notebook, we show how a topological loss can be use to optimize a data set for four clusters.

We start by setting the working directory and importing the necessary libraries.

Load and view data

We load and view the data as follows.

Apply topological optimization to the embedding

We now show how we can use topological optimization to encourage the model underlying the data to become connected. As a topological loss, we will use the persistence of the fourth most prominent gap.

We can now conduct the topological optimization as follows.

We see that ordinary topological optimization resulted in at least four more prominently separated clusters, but points from the same letter in the ICLR acronym are also fragmented into different clusters. Here again, we may use a sampling strategy to both improve the computational efficiency and efectiveness of topological optimization, as we see below.

Note that the sampling strategy is not a guarantee that the points in the four different clusters will remain clustered together. Indeed, the topological loss function does not care whether neighboring points remain close to each other, as long as it reaches at least four clusters. Hence, we can also see that fragmentation starts to occur in the current case, which will likely worsen for further epochs. As seen in the main paper, this can be effectively resolved through topological regularization.