Feature Selection for Multiclass Binary Data

Kushani Perera, Jeffrey Chan, Shanika Karunasekera

Published: 2018, Last Modified: 11 Feb 2026PAKDD (3) 2018EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Feature selection in binary datasets is an important task in many real world machine learning applications such as document classification, genomic data analysis, and image recognition. Despite many algorithms available, selecting features that distinguish all classes from one another in a multiclass binary dataset remains a challenge. Furthermore, many existing feature selection methods incur unnecessary computation costs for binary data, as they are not specifically designed for binary data. We show that exploiting the symmetry and feature value imbalance of binary datasets, more efficient feature selection measures that can better distinguish the classes in multiclass binary datasets can be developed. Using these measures, we propose a greedy feature selection algorithm, CovSkew, for multiclass binary data. We show that CovSkew achieves high accuracy gain over baseline methods, upto \(\sim \)40%, especially when the selected feature subset is small. We also show that CovSkew has low computational costs compared with most of the baselines.