# Stable coreset

## Abstract
As deep learning models continue to scale, the growing computational demands have amplified the need for effective coreset selection techniques. Coreset selection aims to accelerate training by identifying small, representative subsets of data that approximate the performance of the full dataset. Among various approaches, gradient-based methods stand out due to their strong theoretical underpinnings and practical benefits, particularly under limited data budgets. However, these methods face challenges such as naïve stochastic gradient descent (SGD) acting as a surprisingly strong baseline and the breakdown of representativeness due to loss curvature mismatches over time.

In this work, we propose a novel framework that addresses these limitations. First, we establish a connection between posterior sampling and loss landscapes, enabling robust coreset selection even in high-data-corruption scenarios. Second, we introduce a smoothed loss function based on posterior sampling onto the model weights, enhancing stability and generalization while maintaining computational efficiency. We also present a novel convergence analysis for our sampling-based coreset selection method. Finally, through extensive experiments, we demonstrate how our approach achieves faster training and enhanced generalization across diverse datasets than the current state of the art.
## Usage
```
python train.py
```

`--dataset`: The dataset to use. (default: `cifar10`)
- `cifar10`: CIFAR-10 dataset
- `cifar100`: CIFAR-100 dataset
- `tinyimagenet`: TinyImageNet dataset
- `imagenet`: imagenet dataset

`--data_dir`: The directory to store the dataset. (default: `./data`)

`--arch`: The model architecture to use. (default: `resnet20`)
- `resnet20`: ResNet-20 model for CIFAR-10
- `resnet18`: ResNet-18 model for CIFAR-100
- `resnet50`: ResNet-50 model for TinyImageNet
- `roberta`: roberta model for SNLI
- `lenet`: lenet for mnist and emnist

`--selection_method`: The data selection method to use. (default: `random`)

`--train_frac`: The fractrion of training steps to use compared to full training. (default: `0.1`)

To run all experiments, use following:
```
python python_job_submit.py
```
You need to change the setting in the file to file your machine or cluster.
## Acknowledgement
The code is based on [Crest](https://github.com/BigML-CS-UCLA/CREST) and [AdaHessian](https://github.com/amirgholami/adahessian).
