# The datasets used in this paper include:

- (1) Friedman data for regression. The corresponding generation function is provided in the experiment section, which involves $200$ samples, $p^*=5$ true informative features, and $p_u=95$ uninformative features following $\mathcal{N}(0,1)$. And $p_n=10$ noisy features in $\mathcal{N}(100,100)$ are also considered to highlight the robustness better.
  Denote $\epsilon$ as the Gaussian noise $\mathcal{N}(0,1)$, the output $y$ is generated by
  ```
  f(X)=10 \sin \left(\pi X^{(1)} X^{(2)}\right)+20\left(X^{(3)}-0.5\right)^2+10 X^{(4)}+5 X^{(5)}+\epsilon.
  ```

- (2) Synthetic additive data for regression. 
  It involves $N=200$ samples, $p^*=8$ true informative features, and $p_u=92$ uninformative features. We also consider adding $p_n=10$ noisy features following $\mathcal{N}(100,100)$ into the whole dataset,
  ```
  Y=f^*(X)+\epsilon = \sum_{j=1}^{8} f^{(j)}(X^{(j)})+\epsilon,
  ```
  where $f^{(1)}(u) = -2 \sin(2u), \quad f^{(2)}(u)=8u^2, \quad f^{(3)}(u)=\frac{7\sin u}{2-\sin u}, \quad f^{(4)}(u)=6e^{-u}, \quad f^{(5)}(u)=u^3+\frac{3}{2}(u-1)^2, \quad f^{(6)}(u)=5u, \quad f^{(7)}(u)=10\sin(e^{-u/2}), \quad f^{(8)}(u)=-10\widetilde{\phi}(u,\frac{1}{2},\frac{4}{5}^2)$.
  Notably, to validate the additive models on testing sets, the Gram matrices or new splined features for the testing sets must be generated.

- (3) Synthetic additive data for classification. It involves $N=200$ samples, $p^*=2$ informative features, $p_u=98$ uninformative redundant features following $\mathcal{N}(0,1)$ and $p_n=10$ noisy features following $\mathcal{N}(100,100)$, and the output
  ```
  f^*(x_i) = (x_{i}^{(1)}-0.5)^2 + (x_{i}^{(2)}-0.5)^2 -0.08,
  ```
  where $x_{i}^{(j)}=(W_{ij}+U_i)/2$. $W_{ij}$ and $U_i$ are independently from $U(0,1)$ for $i=1, \cdots, 200$, $j=1, \cdots, 100$. 
  The label satisfies $y_i=0$ when $f(x_i) \le 0$ and $1$ otherwise.
  This synthetic data for classification has been widely used in some existing research for evaluating the performance of additive models.

- (4) Synthetic Moon data for classification. It involves two classes with a total of $200$ samples, $p^*$ = 2 informative features, $p_u=$ uninformative, redundant features, and $p_n=$ additional, noisy features. This data has been widely used for estimating the model's capability for correctly identifying different categories.

- (5) Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset for regression. To better highlight the robustness in real-world applications, the ADNI (https://adni.loni.usc.edu/ ) dataset (795 instances, p = 326) is also considered. 

- Four datasets from the UCI repository for regression. 

  (6) Buzz prediction on the Twitter dataset for regression. It involves a total of $38,393$ samples, $p^*$ = 77 original features, and additional $p_n=10$ noisy features. This dataset helps to predict the mean number of active discussions.

  (7) Boston Housing Price dataset for regression. It involves merely 506 samples, $p^*$ = 13 original features, and additional $p_n = 10$ noisy features. This dataset has been widely used for estimating the performance of regression models.

  (8) Ozone Level Detection dataset for regression. It includes $N=2536$ instances with $p^*=73$ attributes, aiming to forecast ground ozone pollution using the given features. We also add $p_n=10$ noisy features into the original dataset.

  (9) SkillCraft Master dataset for regression. The dataset is made of $N=3395$ observations and $p^*=19$ input variables. And $p_n=10$ noisy features are further added to the original dataset.

- Four datasets from the UCI repository for classification. 

  (10) Predicting Buzz Magnitude in the Social Media dataset for classification. It involves $N=38393$ instances with $p^*=77$ original features. We further add $p_n=10$ noisy features into the original datasets for comparing the robustness of these baselines.

  (11) Breast Cancer Wisconsin dataset for classification. There are 569 instances and $p^*=29$ original input features. $p_n=10$ noisy features following $\mathcal{N}(100,100)$ are further added into the original dataset.

  (12) Phishing Websites dataset for classification. It contains 31 columns, with 30 features and one target. The dataset has 2456 observations. 

  (13) Statlog (Heart) dataset for classification. It involves $N=270$ instances with $p^*=13$ input features. Noisy features are further added for comparison.

- Three image datasets for classification or regression.

  (14) The image data from the COIL20 image library, which initially contains 20 objects, is used for classification. For simplicity, the 12th and 13th digits are selected, where there are $N=72$ instances for each digit and $p^*=16384$ original features (gray images with a size of 128 × 128). This dataset has been used for evaluating the prediction performance of semi-supervised learning models on feature reduction.

  (15) CelebA-HQ images (https://www.modelscope.cn/datasets/OmniData/CelebA-HQ.git), which were initially derived from the original CelebA, are used for classification. For simplicity, the 12th and 13th digits are selected, where there are $N=30,000$ instances for each digit and $p^*=262,144$ original features (with a size of $512\times 512$). 

  (16) AgeDB (https://ibug.doc.ic.ac.uk/resources/agedb/) is a specialized facial image dataset that comprises over $N=16,000$ high-quality facial images of 568 distinct subjects, with each subject represented across a significant age span (averaging 13.0 years between the youngest and oldest images per identity). All photos are standardized to a uniform resolution of 224 × 224 pixels ($p^*=50,176$), ensuring consistency for model training and evaluation. 

The above real-world datasets have undergone preliminary data cleaning, where those entries with empty values are filled with mean values, or even removed when significant features are missing (ratio of missing features $\geq20\%$).