Clustering of Polygonal Curves and Time Series

Amer Krivosija

Clustering of Polygonal Curves and Time Series

Amer Krivosija

Published: 01 Jan 2022, Last Modified: 02 Oct 2024Mach. Learn. under Resour. Constraints Vol. 1 (1) 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Partitioning Around Medoids (PAM, k-medoids) is a popular clustering technique to use with arbitrary distance functions or similarities, where each cluster is represented by its most central object, called the medoid or the discrete median.In operations research, this family of problems is also known as the Facility Location Problem (FLP). FastPAM recently introduced a speedup for large k to make it applicable for larger problems, but the method still has a runtime quadratic in N. In this contribution, we discuss a sparse and asymmetric variant of this problem, which can be used on graph data such as road networks. By exploiting sparsity,we can avoid the quadratic runtime and memory requirements, and make this method scalable to even larger problems, as long as we are able to build a small enough graph of sufficient connectivity to perform local optimization. Furthermore, we consider asymmetric cases, where the set of medoids is not identical to the set of points to be covered (or in the interpretation of facility location, where the possible facility locations are not identical to the consumer locations). Because of sparsity, it may be impossible to cover all points with just k-medoids for kvalues which are too small, which would render the problem unsolvable and would break common heuristics for finding a good starting condition. Hence, we consider determining k as a part of the optimization problem and propose to first construct a greedy initial solution with a larger k, then to optimize the problem by alternating between PAMstyle “swap” operations where the result is improved by replacing medoids with better alternatives and “remove” operations to reduce the number of k until neither allows further improvements of the result quality. We demonstrate the usefulness of this method on a problem from electrical engineering, with the input graph derived from cartographic data.Sensor measurements can be represented as points in Rd. Ordered by the timestamps of these measurements, these points yield a time series, that can be interpreted as a polygonal curve in the d-dimensional ambient space. The Fréchet distance is a popular dissimilarity measure for curves, in its continuous and discrete versions. These are the dissimilarity measures of choice should the inner structure of the curves be observed. One of the limitations is the inherent complexity of the computation of the Fréchet distance. It is believed that no algorithms exist to compute the Fréchet distance between two curves with m vertices each (called complexity of the curve) in the running time that is subquadratic in m. Clustering is a fundamental computational task on curves. We consider clustering in the (metric) spaces with the Fréchet distance. The research of the k-clustering problems on curves, with the bounded complexity of the cluster centers, was started by Driemel, Krivošija, and Sohler [185], whose results are limited to the one-dimensional ambient curves. These results started a series of publications, which we survey in the first part of this section. Related to the k-clustering is the middle curve problem [12]. Buchin, Funk, and Krivošija [98] studied the computational complexity of this problem, based on the previous work by Buchin et al. [93, 95], and showed that the middle curve problem is NP-complete. This result is presented in the second part of this section.Hierarchical Agglomerative Clustering (HAC) is likely the earliest and most flexible clustering method, because it can be used with many distances, similarities, and various linkage strategies. It is often usedwhen the number of clusters the dataset forms is unknown and some sort of hierarchy in the data is plausible.Most algorithms for HAC operate on a full distance matrix, and therefore require quadratic memory. The standard algorithm also has cubic runtime to produce a full hierarchy. Both memory and runtime are especially problematic in the context of embedded or otherwise very resourceconstrained systems. In this section, we present how data aggregation with BETULA, a numerically stable version of the well-known BIRCH data aggregation algorithm, can be used to make HAC viable on systems with constrained resources with only small losses on clustering quality, and hence allow exploratory data analysis of very large datasets.A natural strategy for dealing with big data is to compress it. Compression can be used as a preprocessing step, as known from dimensionality reduction tasks, or it can be used to identify underlying patterns in the data that extract the core information. Both learning tasks can be formulated as a matrix factorization. Here, we discuss those matrix factorizations that impose binary constraints on at least one of the factor matrices. Such factorizations are particularly relevant in the field of clustering, where the data is summarized by a set of groups, called clusters. Unfortunately, the optimization methods that are able to integrate binary constraints mostly work under one condition: exclusivity. For clustering applications this entails that every observation belongs to exactly one cluster, which is inept for many applications. We propose a versatile optimization method for matrix factorizations with binary constraints without requiring additional constraints, such as exclusivity. Our method is based on the theory of proximal gradient descent and supports the use of GPUs. We show that our approach is suitable to discover meaningful clusters even in the prevalence of a high level of noise by means of synthetic and real-world data.

Loading