Abstract: Training only the last few layers in deep neural networks has been considered an effective strategy for enhancing the efficiency of on-device training. Prior work has adopted this approach and focused on accelerating backpropagation. However, by conducting a thorough system-wide analysis, we discover that the primary bottleneck is actually the forward propagation through the frozen layers, rather than backpropagation, if only the last few layers are trained. To address this issue, we introduce the "cache and reuse" idea for on-device transfer learning and propose a two-stage training method, which consists of a cache initialization stage, where we store the output from the frozen layers, followed by a training stage. To make our approach practical, we also propose augmented feature caching and cache compression to address the challenges of non-cacheable feature maps and cache size explosion. We carry out extensive experiments on various models (e.g., convolutional neural network and vision transformers) using real edge devices to demonstrate the effectiveness of our method. As an example, on NVIDIA Jetson Orin NX with MobileNet-V2, our approach boosts the training speed by 6.6 ×, and improves the accuracy by 2.1%. For EfficientNet-b0, our method increases the training speed by 2.2 × and improves its accuracy by 1.3%. Therefore, our approach represents a significant improvement in enabling practical on-device transfer learning for edge devices with limited resources.
Loading