Currently, deep neural networks are the state of the art on problems such as speech recognition and computer vision. In this extended abstract, we show that shallow feed-forward networks can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievable with deep models. Moreover, the shallow neural nets can learn these deep functions using a total number of parameters similar to the original deep model. We evaluate our method on TIMIT phoneme recognition task and are able to train shallow fully-connected nets that perform similarly to complex, well-engineered, deep convolutional architectures. Our success in training shallow neural nets to mimic deeper models suggests that there probably exist better algorithms for training shallow feed-forward nets than those currently available.
State From To ( Cc) Subject Date Due Action
New Request
Jimmy Ba Workshop Track
Request for Endorsed for oral presentation: Do Deep Nets Really Need to be Deep?

25 Dec 2013
Reveal: document
Jimmy Ba
Revealed: document: Do Deep Nets Really Need to be Deep?

25 Dec 2013
Completed
Workshop Track Anonymous d691
Request for review of Do Deep Nets Really Need to be Deep?

14 Jan 2014 04 Feb 2014
Completed
Workshop Track Anonymous a881
Request for review of Do Deep Nets Really Need to be Deep?

14 Jan 2014 04 Feb 2014

9 Comments

David Krueger 05 Jan 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
David Krueger
Revealed: document:

05 Jan 2014
Interesting paper. My comments: Abstract: "Moreover, the shallow neural nets can learn these deep functions using a total number of parameters similar to the original deep model." - this does not appear to be true for the CNN model. 4. last sentence first paragraph is missing a "to" at the end of the line "models TO prevent overfitting" 6. "It is challenging to..." sentence needs work 7. "insertion penalty" and "language model weighting" could use definitions or references. figure 1 -> table 1 7.1 The first claim (also made in the abstract) is not supported by the table for the SNN mimicking the CNN. It appears that ~15x as many parameters were needed to achieve the same level of performance. The last sentence of the first paragraph seems to acknowledge this... The second paragraph should, I think, be clarified. How are you increasing performance of the deep networks? What experiments did you perform that lead to this conclusion? 8. The last sentence does not seem supported to me. Your results as presented only achieve the same level of performance as previous results, and in order to achieve this level of performance, it would be necessary to use their training methods first so that your SNNs have something to mimic, correct?
Please log in to comment.
Jimmy Ba 10 Jan 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Jimmy Ba
Revealed: document:

10 Jan 2014
David, thank you for your comments. We submitted a revised draft on Jan 3 that addressed some of your concerns. We’re sorry you read the earlier, rougher draft. You are correct that we are not able to train a shallow net to mimic the CNN model using a similar number of parameters as the CNN model, and the text has been edited to reflect this. We believe that if we had a large (> 100M) unlabelled data set drawn from the same distribution as TIMIT that we would be able to train a shallow model with less than ~15X as many parameters to mimic the CNN with high fidelity, but are unable to test that hypothesis on TIMIT and are now starting experiments on another problem where we will have access to virtually unlimited unlabelled data. But we agree that the number of parameters in the shallow model will not be as small as the number of parameters in the CNN because the weight sharing of the local receptive fields in the CNN allows it to accomplish more with a small number of weights than can be accomplished with one fully-connected hidden layer. Note that the primary argument in the paper, that it is possible to train a shallow neural net (SNN) to be as accurate as a deeper, fully-connected feedforward net (DNN), does not depend on being able to train an SNN to mimic a CNN with the same number of parameters as the CNN. We view the fact that a large SNN can mimic the CNN without the benefit of the convolutional architecture as an interesting, but secondary issue. Thank you again for your comments. We agree with everything you said.
Please log in to comment.
Yoshua Bengio 07 Jan 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Yoshua Bengio
Revealed: document:

07 Jan 2014
This paper asks interesting questions and has interesting experimental results. The generality of the results could be improved by considering more than one dataset, though. You might want to first fix a typo in Rich's name... I concur with David Krueger regarding the somewhat misleading statements in the abstract and introduction etc regarding the matching of depth with width (and a LOT more training examples), which does not apply in the case of a convolutional net. This really needs to be fixed. My take on the results is however quite different from the conclusions given in the paper. The paper makes it sound as if we could find a better way to train shallow nets in order to get results as good as deep nets, as if it was just an optimization issue. My interpretation is quite different. The results seem more consistent with the interpretation that the depth (and convolutions) provide a PRIOR that helps GENERALIZING better. This is consistent with the fact that a much wider network is necessary in the convolutional case, and that in both cases you need to complement the shallow net's training set with the fake/mimic examples (derived from observing the outputs of the deep net on unlabeled examples) in order to match the performance of a deep net. I believe that my hypothesis could be disentangled from the one stated in the paper (which seems to say that it is a training or optimization issue) by looking at training error. According to my hypothesis, the shallow net's training error (without the added fake / mimic examples) should not be significantly worse than that of the deep net (at comparable number of parameters). According to the 'training' hypothesis that the authors seem to state, one would expect training error to be measurably lower for deep nets. In fact, for other reasons I would expect the deep net's training error to be worse (this would be consistent with previous results, starting with my paper with Dumitru Erhan et al in JMLR in 2010). It would be great to report those training errors. Note that to be fair, you have to report training error with no early stopping, continuing training for a fixed and large number of epochs (the same in both cases) with the best learning rate you could find (separately for each type of network). Finally, the fact that even shallow nets (especially wide ones) can be hard to train (see Yann Dauphin's ICLR 2013 workshop-track paper) also weakens the hope that we could get around the difficulty of training deep nets by better training shallow nets. Several more papers need to be cited and discussed. Besides my JMLR 2010 paper with Dumitru Erhan et al (Why Does Unsupervised Pre-training Help Deep Learning), another good datapoint regarding the questions raised here is the paper on Understanding Deep Architectures using a Recursive Convolutional Network, by Eigen, Rolfe & LeCun, submitted to this ICLR 2014 conference. Whereas my JMLR paper is about understanding the advantages of depth as a regularizer, this more recent paper tries to tease apart various architectural factors (including depth) influencing performance, especially for convolutional nets.
Please log in to comment.
Jimmy Ba 10 Jan 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Jimmy Ba
Revealed: document:

10 Jan 2014
Yoshua, thank you for your comments. We believe you may have read an older draft and hope that most or all of the misleading statements were corrected in the Jan 3 draft. Nonetheless, many of your comments still apply to the current paper. We completely agree that generality would be improved with results on additional datasets. We submitted a workshop abstract instead of full paper because we only had results for one data set, and are about to run experiments on two other datasets. With TIMIT we did not use more training data to train the shallow models than was used to train the deep models. We used exactly the same 1.1M training cases used to train the DNN and CNN models to train the SNN mimic model. The only difference is that the mimic SNN does not see the original labels. Instead, it sees the real-valued probabilities predicted by the DNN or CNN it is trying to mimic. In general, model compression works best when a large unlabelled data set is available to be labeled by the “smart” model so that the smaller mimic model can be trained “hard” with less chance of overfitting. But for TIMIT unlabelled data was not available so we used the same data used to train the deep models for compression (mimic) training. We believe that the fact that no extra data --- labeled or unlabelled --- was used to train the SNN models helps drive home the point that it may be possible to train shallow models to be as accurate as deep models. We agree with your comment that “The paper makes it sound as if we could find a better way to train shallow nets in order to get results as good as deep nets, as if it was just an optimization issue.”, except that we view it more perhaps as an issue of regularization than of just optimization. In particular, we agree that depth, when combined with current learning and regularization methods such as dropout, is providing a prior that aids generalization, but are not sure that a similar effect could not be achieved using a different learning algorithm and regularization scheme to train a shallow net on the original data. In some sense we’re making a black-box argument: we already have a procedure that given a training set, yields a shallow net that has accuracy comparable to a deep fully-connected feedforward net trained on the same data. If we hadn’t shown you what the learning algorithm was in our black box would you have been 100% sure that the wizard behind the curtain must have been deep learning? The real question is whether the black box *must* go through the intermediate step of training a deep model to mimic, or whether there exist other learning and regularization procedures that could achieve the same result without going through the deep intermediary. We do not (yet) know the answer to this question, but it is interesting that a shallow model can be trained that is as accurate as a deep model without access to any additional data. We certainly agree that it is difficult to train large, shallow nets on the original targets with the learning procedures currently available. We agree that looking at training errors can be informative, but they might not resolve the issue in this case. If model compression has access to a very large unlabelled data set, if the mimic model has sufficient capacity to represent the deep model, the shallow model will learn to be a high-fidelity mimic of the deep model and will make the same predictions, and the error of the shallow mimic model and deep model on train and test data will be identical as the error of the mimic predictions compared to the deep model is driven to zero. This is for the ideal case where we have access to a very large unlabelled data set, which unfortunately we did not have for TIMIT. Exactly what training errors do you want to see: the error of the DNN on the original training data vs. the error of the SNN trained to mimic the DNN on the real-valued targets, but measured on the original labels of the training points, or vs. the error of an SNN trained on the original data and labels? Early stopping was used when training the deep models, but was not used when training the mimic SNN models. In fact we find it very difficult to make the SNN mimic model overfit when trained with L2 loss on continuous targets. Thanks for the pointers to other papers we should have cited. We’re happy to add them to the abstract. And thanks again for the careful read of our abstract. Sorry you had to struggle through the 1st draft.
Please log in to comment.
Anonymous a881 03 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous a881
Revealed: document: review of Do Deep Nets Really Need to be Deep?

03 Feb 2014
Fulfill
Anonymous a881 Workshop Track
Fulfilled: Request for review of Do Deep Nets Really Need to be Deep?

03 Feb 2014
An interesting workshop paper. For such a provocative title, more results are needed to support the conclusions. Part of the resurgent success of neural networks for acoustic modeling is due to making the networks “deeper” with many hidden layers (see F. Seide, G. Li, and D. Yu, "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks", ICASSP 2011 which shows that shallow networks perform worse than deep for the same # of parameters). This paper provides a different data point where a shallow network is trained using the author’s “MIMIC” technique performs as well as a deep network baseline on the TIMIT phone recognition task. The MIMIC technique involves using unsupervised soft labels from an ensemble of deep nets of unknown size and quality, including a linear layer of unknown size, and training on the un-normalized log prob rather than softmax output. The impact of each of these aspects on their own is not investigated; perhaps a deep neural network would gain from some or all of these MIMIC training steps as well.
Please log in to comment.
Jimmy Ba 18 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Jimmy Ba
Revealed: document:

18 Feb 2014
Thank you for the comments. We completely agree that more results are needed to support the conclusions, and this is why we submitted an extended abstract instead of full paper. More experiments are underway, but we don't yet have final results to add to the abstract. Preliminary results suggest that on TIMIT the MIMIC models are not as accurate as the teacher models mainly because we do not have enough unlabeled TIMIT data to capture the function of the teacher models, as opposed to because the MIMIC models have too little capacity or cannot learn a complex function in one layer. Preliminary results also suggest that: 1) the key to making the shallow MIMIC model more accurate is to train it to be more similar to the deep teacher net, and 2) the MIMIC model is better able to learn to mimic the teacher model when trained on logit (the unnormalized log probabilities) than on the softmax outputs from the teacher net. The only reason for including the linear layer between the input and non-linear hidden layer is to make training of the shallow model faster, not to increase accuracy. Experiments suggest that for TIMIT there is little benefit from using more than 250 linear units. We agree with papers such as Seide, Li, and Yu, that shallow nets perform worse than deep nets given the same # of parameters when trained with the current training algorithms. It is possible that, as Yoshua Bengio suggests, deep models provide a better prior than shallow models for complex learning problems. It is also possible that other training algorithms and regularization methods would allow shallow models to work as well. Or it may be a mix of the two. We believe the question of whether models must be deep to achieve extra accuracy is as yet open, and our experiments on TIMIT provide one data point that suggests it *might* be possible to train shallow models that are as accurate as deeper models on these problems. We have tried using some of the MIMIC techniques to improve the accuracy of deep models. With the MIMIC techniques we have been able to train deep models with fewer parameters that are as accurate as deep models with more parameters (i.e., reduce the number of weights and number of layers needed in the deep models), but we have not been able to achieve significant increases in accuracy for the deep models. If compression is done well, the mimic model will be as accurate as the teacher model, but usually not more accurate, because the MIMIC process tries to duplicate the function (I/O behavior) learned by the teacher model in the smaller student model.
Please log in to comment.
Anonymous d691 05 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous d691
Revealed: document: review of Do Deep Nets Really Need to be Deep?

05 Feb 2014
Fulfill
Anonymous d691 Workshop Track
Fulfilled: Request for review of Do Deep Nets Really Need to be Deep?

05 Feb 2014
The authors show that a shallow neural net trained to mimic a deep net (regular or convolutional) can achieve the same performance as the deeper, more complex models on the TIMIT speech recognition task. They conclude that current learning algorithms are a better fit for deeper architectures and that shallow models can benefit from improved optimization techniques. The experimental results also show that shallow models are able to represent the same function as DNNs/CNNs. To my knowledge, training an SNN to mimic a DNN/CNN through model compression has not been explored before and the authors seem to be getting good results at least on the simple TIMIT task. It remains to be seen if their technique scales up to large vocabulary tasks such as Switchboard and Broadcast News transcription. This being said, a few critiques come to mind: - The authors discuss factoring the weight matrix between input and hidden units and present it as being a novel idea. They should be aware of the following papers: T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy and B. Ramabhadran, "Low-Rank Matrix Factorization for Deep Neural Network Training with High-Dimensional Output Targets," in Proc. ICASSP, May 2013. Jian Xue, Jinyu Li, Yifan Gong, "Restructuring of Deep Neural Network Acoustic Models with Singular Value Decomposition", in Proc. Interspeech 2013. - It is unclear whether the SNN-MIMIC models from Table 1 use any factoring of the weight matrix. If yes, what is k? - It is unclear what targets were used to train the SNN-MIMIC models: DNN or CNN? I assume CNN but it would be good to specify. - On page 2 the feature extraction for speech appears to be incomplete. Are the features logmel or MFCCs? In either case, the log operation appears to be missing. - On page 2 you claim that Table 1 shows results for "ECNN" which is undefined.
Please log in to comment.
Jimmy Ba 18 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Jimmy Ba
Revealed: document:

18 Feb 2014
The reviewer says: “They conclude that current learning algorithms are a better fit for deeper architectures and that shallow models can benefit from improved optimization techniques.” We are not really sure of this, but it is a possibility and we are trying to do the experiments necessary to answer this question. Thanks for pointing us to related work on re-parameterizing the weight matrices. We added these to the extended abstract. What we propose is somewhat different from this prior work. Specifically, we apply weight factorization during training (as opposed to after training) to speed convergence of the mimic model --- the weights of the linear layer and the weights in the non-linear hidden layer are trained at the same time with backprop. The SNN-MIMIC models in Table 1 use 250 linear units in the first layer. We updated the paper to include this information. On page 2, the features are logmel: fourier-based filter banks with 40 coefficients distributed on a mel-scale. We have modified the paper to clarify this. The ECNN on page 2 is an ensemble of multiple CNNs. Both SNN-MIMIC models (8k and 400k) are trained to mimic the ECNN. We mimic an ensemble of CNNs because we don’t have any unlabeled data for TIMIT and thus must use the modest-sized train set for compression. With only 1.1M points available for compression, we observe that the student MIMIC model is usually 2-3% less accurate than the teacher model. We also observe, however, that whenever we make the teacher model more accurate, the student MIMIC model gains a similar amount of accuracy as well (suggesting that the fixed gap between the deep teacher and shallow MIMIC models is due to a lack of unlabeled data, not a limited representational power in the shallow models). Because our goal is to train a shallow model of high accuracy, we needed to use a teacher model of maximum accuracy to help overcome this gap between the teacher and mimic not. If we had a large unlabeled data set for TIMIT this would not be necessary. The ensemble of CNNs is significantly more accurate than a single CNN, but we have not yet published that result. We modified the paper to make all of this clearer.
Please log in to comment.
Jost Tobias Springenberg 25 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Jost Tobias Springenberg
Revealed: document:

25 Feb 2014
Hey, cool Paper! After reading through it carefully I however have one issue with it. The way you present your results in Table 1 seems a bit misleading to me. On first sight I presumed that the mimic network containing 12M parameters was trained to mimic the DNN of the same size while the large network with 140M connections was trained to mimic the CNN with 13M parameters (as is somewhat suggested in your comparison, i.e. by them achieving similar performance). However, as you state in your paper both networks are actually trained to mimic an ensemble of networks with size and performance unknown to the reader. In your response to the reviewers you mention that the mimic network always performs 2-3 % worse than the ensemble. This, to me, suggests that the ensemble performs considerably better than the best CNN you trained. Given that my interpretation is correct the performance of the ensemble should be mentioned in the text and it should be clarified in the table that the mimic networks are trained to mimic this ensemble. Furthermore, assuming a 2 percent gap between the ensemble and the mimic network it is possible that trianing i.e. a three layer network, containing the same number of parameters, could shorten this gap. That is, one could imagine a deeper mimic network to actually perform better than the shallow mimic network (as it is in not nearly perfectly mimicing the ensemble). I think this should be tested and reported alongside your results (if I read your comments to the reviewers correctly you have tried, and succeeded, to train deep networks with fewer parameters to mimic larger ones, strongly hinting that this might be a viable strategy for mimicing the ensemble).
Please log in to comment.

Please log in to comment.