Contributions to unsupervised learning from massive high-dimensional data streams: structuring, hashing and clustering. (Contributions à l'apprentissage non supervisé à partir de flux de données massives en grande dimension: structuration, hashing et clustering).

Anne Morvan

2018 (modified: 09 Nov 2022)undefined2018Readers: Everyone

Abstract: This thesis focuses on how to perform efficiently unsupervised machine learning such as the fundamentally linked nearest neighbor search and clustering task, under time and space constraints for high-dimensional datasets. First, a new theoretical framework reduces the space cost and increases the rate of flow of data-independent Cross-polytope LSH for the approximative nearest neighbor search with almost no loss of accuracy.Second, a novel streaming data-dependent method is designed to learn compact binary codes from high-dimensional data points in only one pass. Besides some theoretical guarantees, the quality of the obtained embeddings are accessed on the approximate nearest neighbors search task.Finally, a space-efficient parameter-free clustering algorithm is conceived, based on the recovery of an approximate Minimum Spanning Tree of the sketched data dissimilarity graph on which suitable cuts are performed.

0 Replies