Keywords: Model parallelism, delayed gradient, reversible architecture, memory reduction
TL;DR: A new method combining delayed gradients and reversible architectures for scalable model parallelism.
Abstract: We propose a new training approach for reversible architectures that enhances computational efficiency by enabling parallel gradient computation across layers, named PETRA (Parallel End-to-end Training of Reversible Architecture). This method diverges from conventional back-propagation by employing an approximate inversion of activations that effectively preserves gradient quality. By reducing the reliance on synchronous operations, our approach achieves high parallelization with only a slight increase in communication overhead. We have tested this method on benchmark datasets including CIFAR-10, ImageNet-32, and ImageNet, and multiple revertible architectures, where PETRA achieved competitive performance with minimal accuracy loss compared to traditional non-revertible training methods. Our method offers a reduced memory footprint compared to delayed gradient or checkpointing techniques. Unlike pipelining strategies, it eliminates the occurrence of bubble effects, enhancing operational efficiency, while being more parallelizable.
Supplementary Material: zip
Primary Area: Optimization for deep networks
Submission Number: 12601
Loading