Abstract: Density-based mode-seeking methods generate a density-ascending dependency from low-density points towards higher-density neighbors.
Current mode-seeking methods identify modes by breaking some dependency connections, but relying heavily on local data characteristics, requiring case-by-case threshold settings or human intervention to be effective for different datasets. To address this issue, we introduce a novel concept called typicality, by exploring the locally defined dependency from a global perspective, to quantify how confident a point would be a mode. We devise an algorithm that effectively and efficiently identifies modes with the help of the global-view typicality. To implement and validate our idea, we design a clustering method called TANGO, which not only leverages typicality to detect modes, but also utilizes graph-cut with an improved path-based similarity to aggregate data into the final clusters. Moreover, this paper also provides some theoretical analysis on the proposed algorithm. Experimental results on several synthetic and extensive real-world datasets demonstrate the effectiveness and superiority of TANGO. The code is available at https://github.com/SWJTU-ML/TANGO_code.
Lay Summary: When computers partition similar data into groups (like categorizing customer profiles or medical images), current methods can spot these groups by first finding “representatives” of groups and then putting other data into the group where the most similar representative is. They usually require manual adjustments for different data distributions, much like needing to recalibrate a machine for every product, because they only look at small, nearby details.
We developed a new concept called "typicality" that evaluates how likely a data sample is to be a representative by analyzing both local and global patterns in the dataset. Our TANGO algorithm then uses this insight to automatically identify representatives, followed by a technique called "graph-cut" to group these representatives and their affiliated data into final partitions.
This eliminates the need for case-by-case calibration and tests on real-world datasets show TANGO outperforms existing methods while being computationally efficient. TANGO makes partitioning data by computers more realistic and reliable for diverse applications - from spotting disease patterns in medical scans to grouping similar products in e-commerce.
Link To Code: https://github.com/SWJTU-ML/TANGO_code
Primary Area: General Machine Learning->Clustering
Keywords: clustering, density-based clustering, mode-seeking, spectral clustering
Submission Number: 9950
Loading