Bootstrap Sampling Improves Model Soup Performance via Increased Model Diversity for Pneumonia Classification

Published: 09 May 2026, Last Modified: 16 May 2026MIDL 2026 - Short Papers PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: model soups, weight averaging, ensemble learning, bootstrap resampling, pneumonia classifcation
TL;DR: We show that bootstrap resampling improves greedy model soups for CNN-based pneumonia classification on ChestX-ray14 by increasing checkpoint diversity, enabling weight averaging to outperform homogeneous model pools.
Registration Requirement: Yes
Abstract: Model soups combine multiple trained neural network checkpoints through weight averaging, often outperforming individual models and achieving performance comparable to deep ensembles without increasing inference cost. However, their effectiveness depends critically on checkpoint diversity, and when models are trained on the same dataset, optimization trajectories may converge toward similar regions of parameter space, limiting this diversity. In this work, we investigate bootstrap resampling as a simple data-level mechanism for increasing checkpoint diversity. Using a binary pneumonia classification task and 644 radiographs from the National Institutes of Health (NIH) ChestXray14 dataset, we train pools of convolutional neural networks under varying bootstrap ratios and construct greedy model soups. While checkpoint models trained on the full dataset achieve the highest mean individual accuracy, they are highly similar and offer little complementary signal, limiting the effectiveness of greedy selection. Bootstrap sampling introduces variability in the training data, producing more diverse checkpoints that, although individually weaker, enable greedy soup construction to combine complementary representations and achieve superior overall performance. The strongest model soup, obtained with 70\% bootstrap sampling, achieves a test accuracy of 0.650, representing a 9.8 percentage point improvement over the mean individual checkpoint accuracy (0.551) under the same condition. While absolute performance is limited by the small cohort size and training-from-scratch setting, this result highlights the substantial gains achievable through diversity-driven weight averaging.
Visa & Travel: No
Read CFP & Author Instructions: Yes
Originality Policy: Yes
Single-blind & Not Under Review Elsewhere: Yes
LLM Policy: Yes
Submission Number: 115
Loading