Abstract: Training deep neural networks (DNNs) with half-precision floating-point formats is widely supported on recent hardware and frameworks. However, current training approaches using half-precision formats neither obtain the optimal throughput due to the involvement of single-precision format nor achieve state-of-the-art model accuracy due to lower numerical digits. In this work, we present a new DNN training engine, named TrainBF, which leverages a typical half-precision format BFloat16 to maximize training throughput while ensuring sufficient model accuracy. TrainBF deploys BFloat16 across the entire training process for best throughput and improves model accuracy by introducing three proposed normalization techniques. TrainBF is also lightweight by only applying these normalization techniques to the layers that are most critical to model accuracy. Furthermore, TrainBF implements a parallel strategy that parallelizes the execution of operators in DNN training to make use of the spare memory space saved by half-precision for better throughput. Evaluating with six common DNN models and compared with the state-of-the-art mixed-precision approach, TrainBF achieves competitive model accuracy with an average throughput speedup of 1.21\(\times \), 1.74\(\times \), and 1.16\(\times \) on NVIDIA A100 GPU, AMD MI100 GPU, and an emerging AI accelerator SambaNova, respectively.
Loading