Stray Intrusive Outliers-Based Feature Selection on Intra-Class Asymmetric Instance Distribution or Multiple High-Density Clusters
TL;DR: This paper proposes the stray intrusive outliers-based feature selection method for high-dimensional data classification with intra-class asymmetric instance distribution or multiple high-density clusters.
Abstract: For data with intra-class Asymmetric instance Distribution or Multiple High-density Clusters (ADMHC), outliers are real and have specific patterns for data classification, where the class body is necessary and difficult to identify. Previous Feature Selection (FS) methods score features based on all training instances or rarely target intra-class ADMHC. In this paper, we propose a supervised FS method, Stray Intrusive Outliers-based FS (SIOFS), for data classification with intra-class ADMHC. By focusing on Stray Intrusive Outliers (SIOs), SIOFS modifies the skewness coefficient and fuses the threshold in the 3$\sigma$ principle to identify the class body, scoring features based on the intrusion degree of SIOs. In addition, the refined density-mean center is proposed to represent the general characteristics of the class body reasonably. Mathematical formulations, proofs, and logical exposition ensure the rationality and universality of the settings in the proposed SIOFS method. Extensive experiments on 16 diverse benchmark datasets demonstrate the superiority of SIOFS over 12 state-of-the-art FS methods in terms of classification accuracy, normalized mutual information, and confusion matrix. SIOFS source codes is available at https://github.com/XXXly/2025-ICML-SIOFS
Lay Summary: In many real-world datasets, such as images or medical records, data within the same class can have complex patterns, like uneven spreads or multiple dense clusters, making it hard to distinguish between classes. Some data points, called stray outliers, which look more like another class (e.g., a resort image mistaken for a school, or handwritten digits 4 and 9 appearing similar). Traditional feature selection (FS) methods treat all data points equally, ignoring these critical outliers. This paper introduces a new FS method, SIOFS, which focuses on these stray outliers that intrude other class bodies. SIOFS identifies the main characteristic of each class using a refined statistical approach, helping identify features that best separate classes. By testing on 16 diverse datasets, SIOFS outperformed 12 existing FS methods in accuracy and reliability. This advance is particularly useful for small or complex datasets where outliers and overlapping classes are common. This paper provides an interesting way to mine the patterns of tricky data, improving automated classification in fields like healthcare or image recognition.
Link To Code: https://github.com/XXXly/2025-ICML-SIOFS
Primary Area: General Machine Learning->Supervised Learning
Keywords: Feature selection, stray intrusive outliers, refined density-mean center, intra-class asymmetric instance distribution or multiple high-density clusters, data classification
Submission Number: 1319
Loading