Abstract: We present a new approach to scalable training of deep learning
machines by incremental block training with intra-block parallel optimization
to leverage data parallelism and blockwise model-update
filtering to stabilize learning process. By using an implementation
on a distributed GPU cluster with an MPI-based HPC machine
learning framework to coordinate parallel job scheduling and collective
communication, we have trained successfully deep bidirectional
long short-term memory (LSTM) recurrent neural networks (RNNs)
and fully-connected feed-forward deep neural networks (DNNs) for
large vocabulary continuous speech recognition on two benchmark
tasks, namely 309-hour Switchboard-I task and 1,860-hour “Switchboard+Fisher”
task. We achieve almost linear speedup up to 16 GPU
cards on LSTM task and 64 GPU cards on DNN task, with either no
degradation or improved recognition accuracy in comparison with
that of running a traditional mini-batch based stochastic gradient
descent training on a single GPU.
Recommender: Nikos Karampatziakis
0 Replies
Loading