Keywords: Efficient Training Method
Abstract: Traditional End-to-End deep learning models typically enhance feature representation capabilities by increasing network depth and complexity. While such an approach improves performance, it inevitably leads to issues such as parameter redundancy and inefficient resource utilization, which become increasingly pronounced as the network deepens. Existing methods have attempted to alleviate these problems by skipping or removing redundant layers. However, they often rely on complex manual designs, which may result in performance degradation, increased computational costs, and reduced memory efficiency.
To address these challenges, we propose a novel training paradigm termed Replacement Learning. This method selectively removes certain layers from the network and substitutes them with additional computing layers in an efficient and automated manner, thereby compensating for the potential performance loss caused by layer removal. Specifically, a computing layer is inserted between the neighboring layers of the removed layer, and it utilizes parameters from the adjacent layers to construct a transformed parameter representation through a simple and efficient learnable block. This transformed representation is then used to perform additional computation on the output of the preceding layer, yielding the final output passed to the subsequent layer. Furthermore, to accommodate architectural variations such as feature map sizes and channel dimensions in different network types, we design a tailored, lightweight learnable block accordingly. Replacement Learning leverages the contextual flow of information between adjacent layers to eliminate unnecessary computation, significantly reducing computational complexity, saving GPU memory usage, and accelerating training. More importantly, it achieves a balanced integration of historical context and newly introduced features, thereby enhancing the overall model performance. We validate the effectiveness of Replacement Learning on five benchmarks—CIFAR-10, STL-10, SVHN, ImageNet, and COCO—across classification and detection tasks using both CNNs and ViTs architectures. Results demonstrate that our method not only significantly reduces the number of network parameters, shortens training time, and lowers memory consumption, but also surpasses traditional End-to-End trained models in performance.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 5072
Loading