In-Register Parameter Caching for Dynamic Neural Nets with Virtual Persistent Processor Specialization

Abstract: Dynamic neural networks enable higher representation flexibility compared to networks with a fixed architecture and are extensively deployed in problems dealing with varying input-induced network structure, such as those in Natural Language Processing. One of the standard optimizations used in static net training is persistency of recurrent weights on the chip. In dynamic nets, possibly-inhomogeneous computation graph for every input prevents caching recurrent weights in GPU registers. Therefore, existing solutions suffer from excessive recurring off-chip memory loads as well as compounded kernel launch overheads leading to underutilization of GPU SMs. In this paper, we present a software system that enables persistency of weight matrices during the training of dynamic neural networks on the GPU. Before the training begins, our approach named Virtual Persistent Processor Specialization (VPPS) specializes a forward-backward propagation kernel that contains in-register caching and operation routines. VPPS virtualizes persistent kernel CTAs as CISC-like vector processors that can be guided to execute supplied instructions. VPPS greatly reduces the overall amount of off-chip loads by caching weight matrices on the chip, while simultaneously, provides maximum portability as it does not make any assumptions about the shape of the given computation graphs hence fulfilling dynamic net requirements. We implemented our solution on DyNet and abstracted away its design complexities by providing simple function calls to the user. Our experiments on a Volta micro-architecture shows that, unlike the most competitive solutions, VPPS shows excellent performance even in small batch sizes and delivers up to 6x speedup on training dynamic nets.
0 Replies
Loading