Layer-Centric Memory Reuse and Data Migration for Extreme-Scale Deep Learning on Many-Core Architectures
Abstract: Due to the popularity of Deep Neural Network (DNN) models, we have witnessed extreme-scale DNN models
with the continued increase of the scale in terms of depth and width. However, the extremely high memory
requirements for them make it difficult to run the training processes on single many-core architectures such as
a Graphic Processing Unit (GPU), which compels researchers to use model parallelism over multiple GPUs to
make it work. However, model parallelism always brings very heavy additional overhead. Therefore, running
an extreme-scale model in a single GPU is urgently required. There still exist several challenges to reduce
the memory footprint for extreme-scale deep learning. To address this tough problem, we first identify the
memory usage characteristics for deep and wide convolutional networks, and demonstrate the opportunities
for memory reuse at both the intra-layer and inter-layer levels. We then present Layrub, a runtime data
placement strategy that orchestrates the execution of the training process. It achieves layer-centric reuse to
reduce memory consumption for extreme-scale deep learning that could not previously be run on a single
GPU. Experiments show that, compared to the original Caffe, Layrub can cut down the memory usage rate
by an average of 58.2% and by up to 98.9%, at the moderate cost of 24.1% higher training execution time
on average. Results also show that Layrub outperforms some popular deep learning systems such as GeePS,
vDNN, MXNet, and Tensorflow. More importantly, Layrub can tackle extreme-scale deep learning tasks. For
example, it makes an extra-deep ResNet with 1,517 layers that can be trained successfully in one GPU with
12GB memory, while other existing deep learning systems cannot.
0 Replies
Loading