Estimating class separability of text embeddings with persistent homology.

Kostis Gourgoulias; Najah Ghalyan; Maxime Labonne; yash satsangi; Sean Moran; Joseph Sabelja

Estimating class separability of text embeddings with persistent homology.

Kostis Gourgoulias, Najah Ghalyan, Maxime Labonne, yash satsangi, Sean Moran, Joseph Sabelja

Published: 14 Jun 2024, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper introduces an unsupervised method to estimate the class separability of text datasets from a topological point of view. Using persistent homology, we demonstrate how tracking the evolution of embedding manifolds during training can inform about class sep- arability. More specifically, we show how this technique can be applied to detect when the training process stops improving the separability of the embeddings. Our results, validated across binary and multi-class text classification tasks, show that the proposed method’s estimates of class separability align with those obtained from supervised methods. This approach offers a novel perspective on monitoring and improving the fine-tuning of sentence transformers for classification tasks, particularly in scenarios where labeled data is scarce. We also discuss how tracking these quantities can provide additional insights into the properties of the trained classifier.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: Camera-ready version. - Added author names. - Updated the "organization" paragraph to point to Section 5 that discusses computational complexity. - Split the paragraph "persistence score" (page 6-7) discussing the statistic and its limit properties / motivation. It should be easier to follow now. - Minor typo fixing. - Added supplementary material with code that can reproduce the experiments. - Added company disclaimer after conclusions (page 12).

Supplementary Material: zip

Assigned Action Editor: ~Yu_Meng1

Submission Number: 2318

Loading