Accelerated Training on Low-Power Edge Devices

TMLR Paper5282 Authors

03 Jul 2025 (modified: 21 Jul 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Training on edge devices poses several challenges as these devices are generally resource-constrained, especially in terms of power. State-of-the-art techniques at the device level reduce the GPU frequency to enforce power constraints, leading to a significant increase in training time. To accelerate training, we propose to jointly adjust the system and application parameters (in our case, the GPU frequency and the batch size of the training task) while adhering to the power constraints on devices. We introduce a novel cross-layer methodology that combines predictions of batch size efficiency and device profiling to achieve the desired optimization. Our evaluation on real hardware shows that our method outperforms the current baselines that depend on state of the art techniques, reducing the training time by up to $2.3\times$ with results very close to optimal. Our measurements also indicate a substantial reduction in the overall energy used for the training process. These gains are achieved without reduction in the performance of the trained model.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=usT9zmom4T
Changes Since Last Submission: We thank the action editor and reviewers for their constructive feedback and valuable comments on our paper. We have refined the claims to make them more specific and included additional improvements based on the reviewers’ suggestions as suggested by action editor (path 1). The major changes, highlighted in blue in the text, are summarized as follows: 1. We clarified and restructured claim 1 to be more specific (i.e., "We motivate and demonstrate the importance of jointly adjusting ..."). The motivation and explanatory examples are presented in the Introduction and Problem Statement sections, and the demonstration is further supported by the results, particularly through the fastest configuration baseline (and our proposed method), highlighting the impact of joint selection of batch size and frequency across the used models, tasks, and devices. Additionally, we emphasized this point to the reader in the description text of the fastest configuration baseline. In addition, we make the claim more specific by stating the model architectures and task types where this is shown. (as suggested by Action Editor) 2. We revised Claim 3 to be more specific to the models and datasets used (as suggested by Action Editor). We also clarified that the stated improvement refers to the peak gain observed in our evaluation and included the average gains under the power constraints scenarios($P^1_{max}$ and $P^2_{max}$). 3. We merged and updated figures 5 and 6 into one figure (Fig. 5) and revised the caption to be more detailed. The caption now clarifies that the values represent the mean percentage increase in training time (where in Section 5.4, we added that the mean is computed over multiple runs as discussed in the last paragraph in Section 5.1). We also indicated that entries with a value of 0\% refer to the cases where our method selects the same configurations as the fastest configuration baseline. Additionally, we added the tasks and the number of datasets in claim 4 to be more specific. (as suggested by Action Editor) 4. Periodic update of LUTs discussion added in Section 4.1. (as suggested by reviewer tT5T) 5. We incorporated additional evaluation results for EfficientViT, to further strengthen our evaluation. (as suggested by reviewers tT5T and tVDn) 6. We added additional visualization of different configurations, power, and training time in Appendix A.4. (as suggested by reviewer tVDn)
Assigned Action Editor: ~Yoshitomo_Matsubara1
Submission Number: 5282
Loading