TESLA: Task-wise Early Stopping and Loss Aggregation for Dynamic Neural Network Inference

Chun-Min Chang; Chia-Ching Lin; Hung-Yi Ou Yang; Chin-Laung Lei; Kuan-Ta Chen

TESLA: Task-wise Early Stopping and Loss Aggregation for Dynamic Neural Network Inference

Chun-Min Chang, Chia-Ching Lin, Hung-Yi Ou Yang, Chin-Laung Lei, Kuan-Ta Chen

15 Feb 2018 (modified: 15 Feb 2018)ICLR 2018 Conference Blind SubmissionReaders: Everyone

Abstract: For inference operations in deep neural networks on end devices, it is desirable to deploy a single pre-trained neural network model, which can dynamically scale across a computation range without comprising accuracy. To achieve this goal, Incomplete Dot Product (IDP) has been proposed to use only a subset of terms in dot products during forward propagation. However, there are some limitations, including noticeable performance degradation in operating regions with low computational costs, and essential performance limitations since IDP uses hand-crafted profile coefficients. In this paper, we extend IDP by proposing new training algorithms involving a single profile, which may be trainable or pre-determined, to significantly improve the overall performance, especially in operating regions with low computational costs. Specifically, we propose the Task-wise Early Stopping and Loss Aggregation (TESLA) algorithm, which is showed in our 3-layer multilayer perceptron on MNIST that outperforms the original IDP by 32\% when only 10\% of dot products terms are used and achieves 94.7\% accuracy on average. By introducing trainable profile coefficients, TESLA further improves the accuracy to 95.5\% without specifying coefficients in advance. Besides, TESLA is applied to the VGG-16 model, which achieves 80\% accuracy using only 20\% of dot product terms on CIFAR-10 and also keeps 60\% accuracy using only 30\% of dot product terms on CIFAR-100, but the original IDP performs like a random guess in these two datasets at such low computation costs. Finally, we visualize the learned representations at different dot product percentages by class activation map and show that, by applying TESLA, the learned representations can adapt over a wide range of operation regions.

7 Replies

Loading