Keywords: Feature selection, contrastive analysis, computational biology
Abstract: The goal of unsupervised feature selection is to select a small number of informative features for use in unknown downstream tasks. Here the definition of ``informative'' is subjective and dependent on the specifics of a given problem domain. In the contrastive analysis (CA) setting, machine learning practitioners are specifically interested in discovering patterns that are enriched in a target dataset as compared to a background dataset generated from sources of variation irrelevant to the task at hand. For example, a biomedical data analyst may wish to find a small set of genes to use as a proxy for variations in genomic data only present among patients with a given disease as opposed to healthy control subjects. However, as of yet the problem of unsupervised feature selection in the CA setting has received little attention from the machine learning community. In this work we present CFS (Contrastive Feature Selection), a method for performing feature selection in the CA setting. We experiment with multiple variations of our method on a semi-synthetic dataset and four real-world biomedical datasets, and we find that it consistently outperforms previous state-of-the-art methods designed for standard unsupervised feature selection scenarios.
One-sentence Summary: We select features better suited for distinguishing between subclasses of a target dataset whose subclasses are determined specifically by variations that are enriched compared to some background..
Supplementary Material: zip
12 Replies
Loading