Adaptive Training Distributions with Scalable Online Bilevel Optimization

TMLR Paper2600 Authors

30 Apr 2024 (modified: 14 May 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Large neural networks pretrained on web-scale corpora are central to modern machine learning. In this paradigm, the distribution of the large, heterogeneous pretraining data rarely matches that of the application domain. This work considers modifying the pretraining distribution in the case where one has a small sample of data reflecting the targeted test conditions. We propose an algorithm motivated by a recent formulation of this setting as an online, bilevel optimization problem. With scalability in mind, our algorithm prioritizes computing gradients at training points which are likely to most improve the loss on the targeted distribution. Empirically, we show that in some cases this approach is beneficial over existing strategies from the domain adaptation literature but may not succeed in other cases. We propose a simple test to evaluate when our approach can be expected to work well and point towards further research to address current limitations.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=SiUzyvAkAg
Changes Since Last Submission: All major changes to the paper are written in blue in the new PDF. The AE made the following comments > Motivation. Reviewer tbsW pointed out that LLMs might have the ability to transfer zero-shot to a wide range of tasks with different test distributions. It is unclear whether the problem this paper solves is relevant. We added several sentences to highlight the benefits of data selection. - We recall that the process of data selection, often performed with ad-hoc methods, is widely regarded as critical, and we explain that our work complements this view (paragraph *Optimization-based Filtering*). - We added training curves in Fig.1, which clearly display that our method allows for much faster training on the downstream specific distribution. > Several reviewers felt that the problem setting was a bit impractical where only a single downstream task is already known. They would suggest experiments in real-world scenarios. We have added a multi-task experiment in Appendix B, where we want to perform well on several downstream tasks. Once again, the proposed methods perform favorably. > More importantly, the authors still did not provide a rebuttal and revision of the paper to address reviewers’ concerns. The new experiments took a long time to run. The reviewers made the following comments: ## Reviewer tbsW > My main concern is about the motivation of this work. See the response to the AE's first point. > The scale of experiments (model and dataset size) is very limited We acknowledge that our methods are not applied to what is nowadays considered ``large'' models. However: - Our experiments are a significant step towards scaling gradient based data selection. [Ren et al. 2018], conduct experiments on MNIST with a LeNet and Cifar10 with resnet32 and WideResNet-28-10 models. [Shu et al. 2019] conduct experiments with resnet32 on cifar10 and cifar100, and fine-tune a resnet50 on the Clothing1M dataset. - The scaling ablation (sec 6.4) suggests that the proposed methods would be well-suited for large-scale models, and we plan to conduct experiments on such models in the future. We have added a sentence about this in the conclusion. We have also added a multi-task experiment in appendix B, which might be closer to real-world scenarios. > The performance of the proposed method is not consistently better than existing ones. Our work evaluates gradient based selection in various experimental settings, at larger scale than reported before. Not all results come positive and we are explicity about it. We even propose an applicability test to help the practitioner whether to consider gradient based selection in their application. We believe that analyzing the applicability of methods and identifying their limitations makes our study helpful to the community, complementing previous work with more limited experiments [Ren et al. 2018, Shu et al. 2019]. ## Reviewer sdVz > I feel the setting is a little impractical where only a single down-stream task is already known at the beginning of learning. I fail to see how the proposed framework could easily adapted into multi-task version. We added a multi task experiment in Appendix B > The paper's novelty is not significant. ndeed our work blends together many existing techniques. Most of these techniques have been tested in smaller settings. Making these methods scale required novel ideas like amortization with a neural network (Sec.4) and the big-batch trick (Sec 4.2). The practical improvements over other baselines are clearly demonstrated in the experiments: our methods work significantly better than other data-selection techniques on the LM tasks. The benefit is not always large which we honestly report. We go deeper and propose a simple test to verify when the method apply. Therefore (1) larger experiments, (2) modifications for scaling, (3) exploration of applicability conditions and (4) applicability test are novel contributions which we believe are interesting to the community and will help future work on gradient based selection start on a better footing. > The proposed method's performance is not significant. See response to rev. tbsW ## Reviewer LxYe > The idea of sample weighting is not that novel, the gradient alignment idea is quite similar to that in [2] We thank the reviewer for the interesting references, which we add to the discussion. The idea of sample weighting is indeed not novel, as we discuss in the related works section. > The experimental results are reported without variance or standard deviation. We report the variance on the LM experiment with 5 runs. Doing it for each experiment would be outside of our computational budget. > No training details or code or hyper-parameter settings were provided for the reproducibility of the results. We now provide a table in the appendix with all the hyper-parameters used in each experiment.
Assigned Action Editor: ~changjian_shui1
Submission Number: 2600
Loading