## Data sets

Each dataset is publicly available. All data is binarized. Numerical features are quantized. When this resulted in too many features (>100) we looked if is possible to drop features without information loss.

For each data set the last column is the label/target.

For some large data sets we sampled down to a more manageable size. For the US Census datasets, we used the `folktables` Python library to retrieve and filter the data.

List of used data, we indicate possible adaptations to the data sets:

* **Banana**: artificial 2D “banana-shaped” clusters (2 features: x & y). Repository: https://sci2s.ugr.es/keel/dataset.php?cod=182  
* **Breast_cancer**: Ljubljana oncology center data; predict recurrence for stages I–III. https://doi.org/10.24432/C51P4M  
* **Diabetes**: Pima Indian females ≥21; predict diabetes from diagnostic measurements. https://10.0.68.224/7zcc8v6hvp.1  
* **Solar_flare**: sunspot/solar-activity observations; predict occurrence of a flare type. http://dx.doi.org/10.5281/zenodo.18110  
* **German_credit**: bank customers labeled good/bad credit risk via attributes. https://doi.org/10.24432/C5NC77  
* **Image**: classify surface type of 3×3 pixel regions from 7 outdoor images (hand-segmented). https://doi.org/10.24432/C5GP4N  
* **Heart**: combined Cleveland, Hungary, Switzerland, Long Beach VA datasets; predict heart disease. https://doi.org/10.24432/C52P4X  
* **Ringnorm**: artificial; classify Gaussians `N(0, 4I)` vs. `N(μ, I)` with `μ = (a,…,a)`, `a = 1/√20`. http://dx.doi.org/10.5281/zenodo.18110  
* **Splice**: recognize exon/intron boundaries in DNA; binarized features reduced from 240 to 61. https://doi.org/10.24432/C5M888  
* **Thyroid**: detect hypothyroidism vs. healthy; Garavan Institute source. https://doi.org/10.24432/C5D010  
* **Titanic**: predict passenger survival (887 points, not the 24-point benchmark); binarized and reduced features from 333 to 94. http://dx.doi.org/10.5281/zenodo.18110  
* **Twonorm**: artificial; classify Gaussians with means `(a,…,a)` and `(-a,…,-a)` where `a = 2/√20`. http://dx.doi.org/10.5281/zenodo.18110  
* **Waveform**: artificial 40-attribute waveform data with noise; classes are convex combos of waveforms. http://dx.doi.org/10.5281/zenodo.18110  

* **Adult**: census data to predict income > \$50K/year. https://doi.org/10.24432/C5XW20  
* **public_coverage_CA2018**: via `folktables`; binarized, reduced from 622 to 57 features, sampled 25% of 2018 data.  
* **public_coverage_TX2018**: via `folktables`; binarized, reduced from 591 to 130 features, sampled 25% of 2018 data.  
* **employment_CA2018**: via `folktables`; standard employment task, sampled 25% of 2018 data.  
* **employment_TX2018**: via `folktables`; standard employment task, sampled 25% of 2018 data.  
* **Compas**: Broward County recidivism prediction; preprocessed to 14 features. https://www.kaggle.com/datasets/danofer/compass  
* **Secondary_mushroom**: simulated mushrooms (7.5× primary size); binarized features reduced  from 111 to 63. https://doi.org/10.24432/C5FP5Q

The feature selection process uses a Random Forest classifier to identify and rank the most important features based on their contribution to model accuracy. First, a subset of data is reserved for feature selection, and categorical and numerical features are preprocessed using one-hot encoding and binning, respectively. The Random Forest model is trained on this preprocessed subset, producing feature importance scores for each feature. Features are then incrementally added based on their importance rank, and accuracy is tracked at each step to identify the optimal number of features. The subset of features that achieves the highest accuracy is selected, and only these features are applied to the final dataset.