submission by rolfe22@gmail.com • Discriminative Recurrent Sparse Auto-Encoders
Abstract: We present the discriminative recurrent sparse auto-encoder model, which consists of an encoder whose hidden layer is recurrent, and two linear decoders, one to reconstruct the input, and one to predict the output. The hidden layer is composed of rectified linear units (ReLU) and is subject to a sparsity penalty. The network is first trained in unsupervised mode to reconstruct the input, and subsequently trained discriminatively to also produce the desired output. The recurrent network is time-unfolded with a given number of iterations, and trained using back-propagation through time. In its time-unfolded form, the network can be seen as a very deep multi-layer network in which the weights are shared between the hidden layers. The depth allows the system to exhibit all the power of deep network while substantially reducing the number of trainable parameters.
From an initially unstructured recurrent network, the hidden units of discriminative recurrent sparse auto-encoders naturally organize into a hierarchy of features. The systems spontaneously learns categorical-units, whose activity build up over time through interactions with part-units, which represent deformations of templates. Even using a small number of hidden units per layer, discriminative recurrent sparse auto-encoders that are pixel-permutation agnostic achieve excellent performance on MNIST.
Decision: conferenceOral-iclr2013-conference
Reply Type:
Author:
Visible To:
Hidden From:
17 Replies
[–][+]
submission review by Yann LeCun
submission review by Yann LeCun
Review: Minor side comment: IN GENERAL, having a cost term at each iteration (time step of the unfolded network) does not eliminate the vanishing gradient problem!!!
The short-term dependencies can now be learned through the gradient on the cost on the early iterations, but the long-term effects may still be improperly learned. Now it may be that one is lucky (and that could apply in your setting) and that the weights that are appropriate for going from the state at t to a small cost at t+delta with small delta are also appropriate for minimizing the longer term costs for large delta.
There are good examples of that in the literature. A toy example is the recurrent network that learns the parity of a sequence. Because of the recursive nature of the solution, if you do a very good job at predicting the parity for short sequences, there is a good chance that the solution will generalize properly to much longer sequences. Hence a curriculum that starts with short sequences and gradually extends to longer ones is able to solve the problem, where only training from long ones without intermediate targets at every time step completely fails.
[–][+]
submission review by Richard Socher
submission review by Richard Socher
Review: Hi,
This looks a whole lot like the semi-supervised recursive autoencoder that we introduced at EMNLP 2011 [1] and the unfolding recursive autoencoder that we introduced at NIPS 2011.
These models also have a reconstruction + cross entropy error at every iteration and hence do not suffer from the vanishing gradient problem.
The main (only?) differences are the usage of a rectified linear unit instead of tanh and restricting yourself to have a chain structure which is just a special case of a tree structure.
[1] http://www.socher.org/index.php/Main/Semi-SupervisedRecursiveAutoencodersForPredictingSentimentDistributions
[–][+]
submission reply by Jason Rolfe
submission reply by Jason Rolfe
Reply: Thank you very much for your constructive comments.
There are indeed similarities between discriminative recurrent auto-encoders and the semi-supervised recursive autoencoders of Socher, Pennington, Huang, Ng, & Manning (2011a); we will add the appropriate citation to the paper. However, the networks of Socher et al. (2011a) are very similar to RAAMs (Pollack, 1990), but with a dynamic, greedy recombination structure and a discriminative loss function. As a result, they differ from DrSAE as outlined in our response to Jurgen Schmidhuber. Like the work of Socher et al. (2011a), DrSAE is based on an recursive autoencoder that receives input on each iteration, with the top layer subject to a discriminative loss. However, Socher et al. (2011a), like Pollack (1990), iteratively adds new information on each iteration, and then reconstructs both the new information and the previous hidden state from the resulting hidden state (Socher, Huang, Pennington, Ng, & Manning, 2011 reconstructs the entire history of inputs). The discriminative loss function is also applied at every iteration. In contrast, the input to DrSAE is the same on each iteration, and only the reconstruction and classification based upon the final state is optimized. The entire recursive LISTA stack constitutes a single encoder, which is decoded in a single (linear) step. Whereas Socher et al. (2011a) performs discriminative compression of a variable-length, structured input using a zero-hidden-layer encoder, our goal is static autoencoding using a deep (recursive) encoder.
Moreover, the main contribution of our paper is the demonstration of a novel and interesting hidden representation (based upon prototypes and their deformations along the data manifold), along with a network that naturally learns this representation. The hierarchical refinement of categorical-units from part-units that we observe seems unlikely to evolve in the networks of Socher et al. (2011a), since the activity of the part-units cannot be maintained across iterations by continuous input. The KL-divergence used for discriminative training in Socher et al. (2011a) is only identical to the logistic loss if the target distributions have no uncertainty (i.e., they are one-hot). Our ongoing work suggests that this difference is likely to be important for the differentiation of categorical-units and part-units.
[–][+]
submission review by Andrew Maas
submission review by Andrew Maas
Review: Interesting work! The use of relU units in an RNN is something I haven't seen before. I'd be interested in some discussion on how relU compares to e.g. tanh units in the recurrent setting. I imagine relU units may suffer less from vanishing/saturation during RNN training.
We have a related model (deep discriminative recurrent auto-encoders) for speech signal denoising, where the task is exactly denoising the input features instead of classification. It would be nice to better understand how the techniques you present can be applied in this type of regression setting as opposed to classification.
Andrew L. Maas, Quoc V. Le, Tyler M. O'Neil, Oriol Vinyals, Patrick Nguyen, and Andrew Y. Ng. (2012). Recurrent Neural Networks for Noise Reduction in Robust ASR. Interspeech 2012.
http://ai.stanford.edu/~amaas/papers/drnn_intrspch2012_final.pdf
[–][+]
review of Discriminative Recurrent Sparse Auto-Encoders
submission review by anonymous reviewer dd6a • review of Discriminative Recurrent Sparse Auto-Encoders
Review: The paper describes the following variation of an autoencoder: An encoder (with relu nonlinearity) is iterated for 11 steps, with observations providing biases for the hiddens at each step. Afterwards, a decoder reconstructs the data from the last-step hiddens. In addition, a softmax computes class-labels from the last-step hiddens. The model is trained on labeled data using the sum of reconstruction and classification loss. To perform unsupervised pre-training the classification loss can be ignored initially.
It is argued that training the architecture causes hiddens to differentiate into two kinds of unit (or maybe a continuum): part-units, which mainly try to perform reconstruction, and categorical units, which try to perform classification. Various plots are shown to support this claim empirically.
The idea is interesting and original. The work points towards a direction that hasn't been explored much, and that seems relevant in practice and from the point of view of how classification may happen in the brain. Some anecdotal evidence is provided to support the part-categorical separation claim. The evidence seems interesting. Though I'm pondering still whether there may be other explanations for those plots. Training does seem to rely somewhat on finely tuned parameter settings like individual learning rates and weight bounds.
It would be nice to provide some theoretical arguments for why one should expect the separation to happen. A more systematic study would be nice, too, eg. measuring how many recurrent iterations are actually required for the separation to happen. To what degree does that separation happen with only pre-training vs. with the classification loss? And in the presence of classification loss, could it happen with shallow model, too? The writing and organization of the paper seems preliminary and could be improved. For example, it is annoying to jump back-and-forth to refer to plots, and some plots could be made more informative (see also comments below).
The paper seems to suggest that the model gradually transforms an input towards a class-template. I'm not sure if I agree, that this is the right view given that the input is clamped (by providing biases via E) so it is available all the time. Any comments?
It may be good to refer to 'Learning continuous attractors in recurrent networks', Seung, NIPS 1998, which also describes a recurrent autoencoder (though that model is different in that it iterates encoder+decoder not just encoder with clamped data).
Questions/comments:
- It would be much better to show the top-10 part units and the top-10 categorical units instead of figure 2, which shows a bunch of filters for which it is not specified to what degree they're which (except for pointing out in the text that 3 of them seem to be more like categorical units).
- What happens if the magnitude of the rows of E is bounded simply by 1/T instead of 1.25/(T-1) ? (page 3 sentence above Eq. 4) Are learning and classification results sensitive to that value?
- Last paragraph of section 1: 'through which the prototypes of categorical-units can be reshaped into the current input': Don't you mean the other way around?
- Figure 4 seems to suggest that categorical units can have winner-takes-all dynamics that disfavor other categorical units from the same class. Doesn't that seem strange?
- Section 3.2 (middle) mentions why S-I is plotted but S-I is shown and referred to before (section 3.1) and the explanation should instead go there.
- What about the 2-step model result with 400 hiddens (end of section 4)?
[–][+]
submission reply by Jason Tyler Rolfe, Yann LeCun
submission reply by Jason Tyler Rolfe, Yann LeCun
Reply: *Anonymous dd6a
Thank you very much for your helpful comments.
P2: Both the categorical-units and the part-units participate in reconstruction. Since the categorical-units become more active than the part-units (as per figure 7), they actually make a larger contribution to the reconstruction (evident in figure 9(b,c), where even the first step of the progressive reconstruction is strong).
P4: The differentiation into part-units and categorical-units does occur even with only two ISTA iterations (one pass through the explaining-away matrix), the shallowest architecture in which categorical-units can aggregate over part-units, as noted at the end of section 4. Without the classification loss, the network is an instance of (non-negative) LISTA, and categorical-units do not develop at all. Thus, only one recurrent iteration is required for categorical-units to emerge, and the classification loss is essential for categorical-units to emerge. We have added plots to figure 3 demonstrating these phenomena.
With regards to the theoretical cause of the differentiation into categorical-units and part-units, please see part 1 of our response to Yoshua Bengio.
The three plots at the end were intended to serve as supplementary materials. However, as you point out, these figures are important for the analysis presented in the text, so they have been moved into the main text.
P5: The network decomposes the input into a prototype and a sparse set of perturbations; we refer to these perturbations, encoded in the part-units, as the signal that 'transforms' the prototype into the input. That is, categorical + part ~ input. The input itself is not (and need not be) modified in the process of constructing this decomposition. The clamping of the input does not affect this interpretation.
P6: Thank you for the reference; we have included it in the paper. Of course, since Seung (1998) does not include a discriminative loss function, there is no reason to believe that categorical-units differentiated from part-units in his model.
Q1: We have made the suggested change to figure 2. Filters sorted by categoricalness are also shown in figures 5, 6, 7, and 10.
Q2: We have not yet undertaken a rigorous or extensive search of hyperparameter space. We expect the results with with the rows of E bounded by 1/T will be similar to those with a bound of 1.25/T. The (T-1) in the denominator of this bound in the paper was a typo, which we have corrected.
Q3: The assertion that 'the prototypes of categorical-units are reshaped into the current input' is mathematically equivalent to 'the current input is reshaped into the prototypes of categorical-units.' In one case, categorical + part = input; in the other, input - part = categorical. Both interpretations are actively enforced by the reconstruction component of the loss function L^U in equation 1. Since the inputs are clamped, we find it most intuitive to think of the reconstruction due to the prototypes of the categorical-units being reshaped by the part-units to match the fixed input.
Q4: When a chosen categorical-unit suppresses other categorical-units of the same class, it corresponds to the selection of a single prototype, which is both natural and desirable. It is easy to imagine that there may be classes with multiple prototypes, for which arbitrary linear combinations of the prototypes are not members of the class. For example, the sum of a left-leaning 1 and a right-leaning 1 is an X, rather than a 1.
Q5: Indeed, the ISTA-mediated relationship between S-I and D^t*D is first discussed in the second paragraph of section 3. This is the clearest explanation for the use of S-I. We have removed other potentially-confusing, secondary justifications, and further clarified the intuitive basis of this primary justification.
Q6: We have added the requested result on the 2-step model with 400 hiddens at the end of section 4. The trend is the same with 400 units as with 200 units. If the number of recurrent iterations is decreased from eleven to two, MNIST classification error in a network with 400 hidden units increases from 1.08% to 1.32%. With only 200 hidden units, MNIST classification error increases from 1.21% to 1.49%, although the hidden units still differentiate into part-units and categorical-units.
[–][+]
submission reply by Jason Tyler Rolfe, Yann LeCun
submission reply by Jason Tyler Rolfe, Yann LeCun
Reply: Q2: In response to your query, we have just completed a run with the encoder row magnitude bound set to 1/T, rather than 1.25/T. MNIST classification performance was 1.13%, rather than 1.08%. Although heuristic, the hyperparameters used in the paper were not the result of extensive hand-tuning.
[–][+]
review of Discriminative Recurrent Sparse Auto-Encoders
submission review by anonymous reviewer bc93 • review of Discriminative Recurrent Sparse Auto-Encoders
Review: SUMMARY:
The authors describe a discriminative recurrent sparse auto-encoder, which is essentially a recurrent neural network with a fixed input and linear rectifier units. The auto-encoder is initially trained to reproduce digits of MNIST, while enforcing a sparse representation. In a later phase it is trained in a discriminative (supervised) fashion to perform classification.
The authors discuss their observations. Most prominently they describe the occurrence of two types of nodes: part-units, and categorical units The first are units that encode low-level features such as pen-strokes, whereas the second encode specific digits within the MNIST set. It is shown that before the discriminative training, the image reconstruction happens mostly by combining pen-strokes, whereas after the discriminative training, image reproduction happens mainly by the combination of a prototype digit of the corresponding class, which is subsequently transformed by adding pen-stroke-like features. The authors state that this observation is consistent with the underlying hypothesis of auto-encoders that the data lies on low-dimensional manifolds, and the auto-encoder learns to split the representation of a digit into a categoric prototype and a set of transformations.
GENERAL OPINION
The paper and the suggested network architecture is interesting and, as far as I know, quite original. It is also compelling to see the unique ways in which the unsupervised and supervised training contribute to the image reconstruction. Overall I believe this paper is a suited contribution to this conference. I have some questions and remarks that I will list here.
QUESTIONS
- From figure 5 I get the impression that the states dynamics are convergent; for sufficiently large T, the internal state of the nodes (z) will no longer change. This begs the question: is the ideal situation that where T goes to infinity? If so, could you consider the following scenario: We somehow compute the fixed, final state (maybe this can be performed faster than by simply iterating the system). Once we have it, we can perform backpropagation-through-time on a sequence where each step in time, the states are identical (the fixed-point state). This would be an interesting scenario, as you might be able to greatly accelerate the training process (all Jacobians are identical, error backpropagation has an analytical solution), and you explicitly train the system to perform well on this fixed point, transient effects are no longer important.
Perhaps I'm missing some crucial detail here, but it seems like an interesting scenario to discuss.
- On a related note: what happens if - after training - the output (image reconstruction and classification) is constructed using the state from a later/earlier point in time? How would performance degrade as a function of time?
REMARKS
- In both the abstract and the introduction the following sentence appears: 'The depth implicit in the temporally-unrolled form allows the system to exhibit all the power of deep networks, while substantially reducing the number of trainable parameters'. I believe this is an dangerous statement, as tied weights will also impose a severe restriction on representational power (so they will not have 'all the power of deep networks'). I would agree with a rephrasing of this sentence that says something along the lines of: 'The depth implicit in the temporally-unrolled form allows the system to exhibit far more representational power, while keeping the number of trainable parameters fixed'.
- I agree with the Yoshua's remark on the vanishing gradient problem. Tied weights cause every change in parameter space to be exponentially amplified/dampened (save for nonlinear effects), making convergence harder. The authors should probably rewrite this sentence.
- I deduce from the text that the system is only trained to provide output (image reconstruction and classification) at the T-th iteration. As such, the backpropagated error only is 'injected' at this point in time. This is distinctly different form the 'common' BPTT setup, where error is injected at each time step, and the authors should maybe explicitly mention this. Apparently reviewer 'Anonymous 8ddb' has interpreted the model as if it was to provide output at each time step ('the reconstruction cost found at each step which provide additional error signal'), so definitely make this more clear.
- The authors mention that they trained the DrSAE with T=11, so 11 iterations. I suspect this number emerges from a balance between computational cost and the need for a sufficient amount of iterations? Please explicitly state this in your paper.
- As a general remark, the comparison to ISTA and LISTA is interesting, but the authors go to great lengths to finding detailed analogies, which might not be that informative. I am not sure whether the other reviewers would agree with me, but maybe the distinction between categorical and part-units can be deduced without this complicated and not easy-to-understand analysis. It took me some time to figure out the content of paragraphs 3.1 and 3.2.
- I also agree with other reviewers that it is unfortunate that only MNIST has been considered. Results on more datasets, and especially other kinds of data (audio, symbolic?) might be quite informative
[–][+]
submission reply by Jason Tyler Rolfe, Yann LeCun
submission reply by Jason Tyler Rolfe, Yann LeCun
Reply: * Anonymous bc93:
We offer our sincere thanks for your thoughtful comments.
Q1: The dynamics are indeed smooth, as shown in figure 5. However, there is no reason to believe that the dynamics will stabilize beyond the trained interval. In fact, simulations past the trained interval show that the most active categorical unit often seems to grow continuously.
Q2: The image reconstruction is small for the first iteration or two, but thereafter is stable throughout the trained interval and beyond. Classification is more sensitive to the exact balance between part-units and categorical-units, and is less reliable as one moves away from the trained iteration T.
R1: Any multilayer network (say with L layers of M units) can be seen as a recurrent network with M*L units, unrolled for L time steps, which is sparsely connected (e.g. with a block upper triangular matrix). Admittedly, this would be a computationally inefficient way to run the multilayer network. But the representational power of the two networks are identical. Hence recurrent nets are not intrinsically less powerful than multilayer ones, if one is willing to make them large. DrSAE leaves it up to the learning algorithm to decide which hidden units will act as 'lower-layer' or 'upper-layer' units.
R2: The reference to the vanishing gradients problem was tangential and, given its contentious nature, has been removed from the paper. Nevertheless, please see our comments on the matter to the other reviewers.
R3: The loss functions are indeed only applied to the last iteration of the hidden units. We have added an explicit mention of this in the text to avoid confusion. Future work will explore the use of a reconstruction cost summed over time. This may have the effect of quickening the convergence of the inference and making the classification and reconstruction more stable past the training interval.
R4: The T=11 could more appropriately called T'=10, since there are 10 applications of the explaining-away matrix S, although T=11 represents the number of applications of the non-lineaity. Experiments were conducted for T=2, T=6, and T=11. The paper focuses mostly on T=11. We have added a note to this effect.
R5: While the existence of a dichotomy between part-units and categorical-units is certainly identifiable without recourse to ISTA, as is evident from figures 8 and 10, the understanding of the part-units is best framed in terms of ISTA, which predicts the learned parameters with considerable accuracy. Were it not for the fact that our network architecture is derived from ISTA, it would be remarkable that the part-units spontaneously learn parameters that so closely match with ISTA.
While perhaps unfamiliar to some readers, ISTA is simple and intuitive; we suspect that the difficulty you allude to is primarily an issue of nomenclature. With non-negative units, ISTA is just projected gradient descent on the loss function of equation 1 (the projection is onto the non-negativity constraint). We have added a note to this effect in paragraph 3.1, which we hope will make this analysis easier to follow for readers unfamiliar with ISTA.
R6: Please see our response to the other reviewers.
[–][+]
submission reply by anonymous reviewer bc93
submission reply by anonymous reviewer bc93
Reply: It's true that any deep NN can be represented by a large recurrent net, but that's not the point I was making. The sentence I commented on gives the impression that a recurrent network has the same representational power as any deep network 'while substantially reducing the number of trainable parameters'. If you construct an RNN the way you described in your answer to my remark, you don't reduce the number of trainable parameters at all.
Put differently, the impression that this particular sentence gives, is that you can simply take a recurrent net, iterate it 5 times, and you would have the same representational power as any 5-layer deep NN (with the same number of nodes in each layer as the RNN), but with only one 5-th of the trainable parameters. This is, as I'm sure you'll agree, simply not true.
Remember, my remark is only concerned with the precise wording of the message you wish to convey. I do agree that iterating the network gives you more representational power for a fixed number of trainable parameters (that is more or less what you have shown in your paper), just not that it gives you as much representational power as in the case where the recurrent weights can be different each iteration (which is what happens in an equivalent deep NN).
[–][+]
submission review by Jason Rolfe
submission review by Jason Rolfe
Review: * Jurgen Schidhuber:
Thank you very much for your constructive comments.
1. Like the work of Pollack (1990), DrSAE is based on an recursive autoencoder that receives input on each iteration. However, (sequential) RAAMs iteratively add new information on each iteration, and then iteratively reconstruct the entire history of inputs from the resulting hidden state. In contrast, the input to DrSAE is the same on each iteration, and only the reconstruction based upon the final state is optimized. The entire recursive LISTA stack constitutes a single encoder, which is decoded in a single (linear) step. Whereas RAAMs perform unsupervised history compression, our goal is static autoencoding. Moreover, DrSAEs perform classification in addition to autoencoding; the logistic loss component is essential to the differentiation into categorical-units and part-units (RAAMs have no discriminative component). Finally, DrSAE's encoder is non-negative LISTA (a multi-layer network of rectified linear units, with tied parameters between the layers, and a projection from the input to all layers), its decoder is linear, and it makes use of a loss function including L1 regularization and logistic classification loss (RAAMs use a single-hidden-layer sigmoidal neural network without sparsification). RAAMs and DrSAEs are both recurrent and receive some sort of input on each iteration, but they have different architectures and solve different problems; they resemble each other only in the coarsest possible manner.
2. Please see point 2(b) in response to reviewer Anonymous 8ddb; the references to the vanishing gradient problem were tangential, and have been removed.
3. As you point out, it is well-known that data set augmentations (such as translations and elastic deformation of the input) and explicit regularization of the parameters to force the corresponding invariances (such as a convolutional network structure) improve the performance of machine learning algorithms of this type. It is similarly possible to improve performance by training many instances of the same network (perhaps on different subsets of the data) and aggregating their outputs. It is standard practice to separately report performance with and without making use of these techniques. Deformations can obviously be added in later to yield improved performance. We have added a note regarding the possibility of these augmentations, along with the appropriate citations.
[–][+]
submission review by Jason Rolfe
submission review by Jason Rolfe
Review: We are very thankful to all the reviewers and commenters for their constructive comments.
* Anonymous 8ddb:
1. Indeed, the architecture of DrSAE is similar to a deep sparse rectifier neural network (Glorot, Bordes, and Bengio, 2011) with tied weights (Bengio, Boulanger-Lewandowski and Pascanu, 2012). In addition to the loss functions used, DrSAE differs from deep sparse rectifier neural networks with tied weights in that the input projects to all layers. We note this connection in the next-to-last paragraph of Section 1, and have added the reference to the citation you suggest.
It is true that part-units are strongly connected to the inputs while categorical-units are more strongly connected to part units than to the inputs. The categorical-units seem to act like units in the top layers of a multilayer network.
2(a). The input is indeed fed into all layers. We have added an explicit mention of this in the third paragraph of section 1, and in the first paragraph of section 2.
2(b). We removed the statement suggesting that DrSAE is less subject to the vanishing gradient problem in the introduction, because we have little hard evidence for it in the paper.
However, the intuition behind the statement is somewhat opposite to Yoshua Bengio's argument: the overall 'gain' of the recurrent encoder network (without input provided to each layer) must be around 1, simply because it is trained to reconstruct the input through a linear decoder whose columns have norm equal 1. The unit activities can neither explode nor vanish over the recurrent steps because of that. Since the overall recurrent encoder has gain 1, each of the (identical) layers must have gain 1 too. Because of the reconstruction criterion, each recurrent step must also be approximately invertible (otherwise information would be lost, and reconstruction would be impossible). It is our intuition that in a sequence of invertible layers whose gain in 1, there is little vanishing gradient issues and little gradient 'diffusion' issue (the informal notion of gain can be made precise with in terms of eigenvalues of the Jacobian).
We do observed that as training of a DrSAE progresses, the magnitude of the gradient tends to equalize between all layers. But this will be the subject of future investigations.
3. The column-wise bounds on the norms of the matrices are enforced through projection on the unit sphere (i.e., column-wise scaling) after each SGD step. We have added explicit mention of this in footnote 2.
5. Units still differentiate into part-units and categorical-units with only two temporal steps, but the prototypes are not as clean. We have added a mention of this to the end of section 4. Further investigation of the effect of the choice of the encoder on the differentiation into categorical-units and part-units will be the subject of future work.
* Testing on other datasets than MNIST (Anonymous 8ddb and Anonymous a32e):
Yes, results other datasets like CIfAR would be ideal, but this will require a convolutional (or locally-connected) version of the method, since almost all architectures that yield good results on natural image datasets are of that type. We are currently working on a convolutional extension to DrSAE, which we are applying to classification of natural image datasets. But we believe that the architecture, algorithm, and results are interesting enough to be brought to the attention of the community before results on natural images become available.
That said, in preliminary testing using fully-connected DrSAE, we've obtained results superior to the deep sparse rectifier neural networks of Glorot, Bordes & Bengio (2011) on CIFAR-10; specifically, 48.19% error rate using only 200 hidden units per layer, versus their error rate 49.52% using 1000 hidden units per layer. Since Glorot et al. use a similar architecture (as discussed in point 1), this suggests that the differentiation into part-units and categorical-units improves classification performance on natural images.
* Anonymous a32e:
1. The architecture of the network is captured by equation 2 and figure 1. The loss function is specified in equations 1 and 4. The review of prior work and discussion of its relation to our network necessarily assumes familiarity with the prior work, since there is only space for a cursory summary of the published ideas upon which we draw. However, we would hope that the main analysis in the paper, in sections 3, 4, and 5, are understandable even without intimate familiarity with LISTA and the like.
2. The natural way to avoid manually chosen constants is to do an automatic search of hyperparameter space, maximizing the performance on a validation set. We hope to perform this search in the near future, as it will likely improve classification performance. As it stands, our ad-hoc parameters effectively offer a lower bound on the performance obtainable with a more rigorous search of hyperparameter space.
4. There are two kinds of 'fairness' in comparing results: 1. keep the computational complexity constant; 2. keep the number of parameters constant. The comparison between 2 and 11 time steps is intended to keep the number of parameters constant (though it does increase the computational complexity). It is unclear how one could hold both the number of parameters and the computational load constant within the DrSAE framework.
5. A more systematic exploration of encoder depth should certainly be undertaken as part of a complete search of the hyperparameter space.
* Yoshua Bengio:
1. We are presently exploring the cause of the differentiation into part-units and categorical units. In particular, we've now succeeded in inducing the differentiation using an unsupervised criterion derived from the discriminative loss of DrSAE. The interaction between our logistic loss function and the autoencoding framework thus seems to constitute the crucial ingredient beyond what is present in similar networks like your deep sparse rectifier neural networks. This work is ongoing, but we look forward to reporting this result soon. It would be interesting to explore the degree to which the rectified-linear activation function is necessary for the differentiation into part- and categorical-units. Our intuition, based upon experience with this unsupervised regularizer, as well as the fact that units differentiate even in a two-hidden-layer DrSAE, is that this activation function is not essential.
2. Please see point 2(b) in response to reviewer Anonymous 8ddb.
3 & 4. Thank you for the references. They have been included in the paper. We think it is worth noting, though, that dropout, tangent propagation, and iterative pretraining and stacking of networks (as in deep convex networks) are regularizations or augmentations of the training procedure that may be applicable to a wide class of network architectures, including DrSAE.
[–][+]
submission review by Jürgen Schmidhuber
submission review by Jürgen Schmidhuber
Review: Interesting implementation and results.
But how is this approach related to the original, unmentioned work on Recurrent Auto-Encoders (RAAMs) by Pollack (1990) and colleagues? What's the main difference, if any? Similar for previous applications of RAAMs to unsupervised history compression, e.g., (Gisslen et al, AGI 2011).
The vanishing gradient problem was identified and precisely analyzed in 1991 by Hochreiter's thesis http://www.bioinf.jku.at/publications/older/3804.pdf . The present paper, however, instead refers to other authors who published three years later.
Authors write: 'MNIST classification error rate (%) for pixel-permutation-agnostic encoders' (best result: 1.08%). What exactly does that mean? Does it mean that one may not shift the input through eye movements, like in the real world? I think one should mention and discuss that without such somewhat artificial restrictions the best MNIST test error is at least 4 times smaller: 0.23% (Ciresan et al, CVPR 2012).
[–][+]
review of Discriminative Recurrent Sparse Auto-Encoders
submission review by anonymous reviewer a32e • review of Discriminative Recurrent Sparse Auto-Encoders
Review: Authors propose an interesting idea to use deep neural networks with tied weights (recurrent architecture) for image classification. However, I am not familiar enough with the prior work to judge novelty of the idea.
On the critical note, the paper is not easy to read without good knowledge of prior work, and is pretty long. I would recommend authors to consider following to make their paper more accessible:
- the description should be shorter, simpler and self-contained
- try to avoid the ad-hoc constants everywhere
- run experiments on something larger and more difficult than MNIST - current experiments are not convincing to me; together with many hand-tuned constants, I would be afraid that this model might not work at all on more realistic tasks (or that a lot of additional manual work would be needed)
- when you claim that accuracy degrades from 1.21% to 1.49% if 2 instead of 11 time steps are used, you are comparing models with much different computational complexity: try to be more fair
- also, it would be interesting to show results for the larger model (400 neurons) with less time steps than 11
Still, I consider the main idea interesting, and I believe it would lead to interesting discussions at the conference.
[–][+]
review of Discriminative Recurrent Sparse Auto-Encoders
submission review by anonymous reviewer 8ddb • review of Discriminative Recurrent Sparse Auto-Encoders
Review: Summary and general overview:
----------------------------------------------
The paper introduces Discriminative Recurrent Sparse Auto-Encoders, a new model, but more importantly a careful analysis of the behaviour of this model. It suggests that the hidden layers of the model learn to differentiate into a hierarchical structure, with part units at the bottom and categorical units on top.
Questions and Suggestions
----------------------------------------
1. Given equation (2) it seems that model is very similar to recurrent neural network with rectifier units as the one used for e.g. in [1]. The main difference would be how the model is being trained (the pre-training stage as well as the additional costs and weight norm constraints). I think this observation could be very useful, and would provide a different way of understanding the proposed model. From this perspective, the differentiation would be that part units have weak recurrent connections and are determined mostly by the input (i.e. behave as mlp units would), while categorical units have strong recurrent connections. I'm not sure if this parallel would work or would be helpful, but I'm wondering if the authors explored this possibility or have any intuitions about it.
2. When mentioning that the model is similar to a deep model with tied weights, one should of course make it clear that additionally to tied weights, you feed the input (same input) at each layer. At least this is what equation (2) suggests. Is it the case? Or is the input fed only at the first step ?
2. As Yoshua Bengio pointed out in his comment, I think Recurrent Networks, and hence DrSAE, suffer more from the vanishing gradient problem than deep forward models (contrary to the suggestion in the introduction). The reason is the lack of degrees of freedom RNNs have due to the tied weights used at each time step. If W for an RNN is moved such that its largest eigenvalue becomes small enough, the gradients have to vanish. For a feed forward network, all the W_i of different layers need to change such to have this property which seems a less likely event. IMHO, the reason for why DrSAE seem not to suffer too much from the vanishing gradient is due to (a) the norm constraint, and (b) the reconstruction cost found at each step which provide additional error signal. One could also say that 11 steps might not be a high enough number for vanishing gradient to make learning prohibitive.
3. Could the authors be more specific when they talk about bounding the column-wise norm of a matrices. Is this done through a soft constraint added to the cost? Is it done, for e.g., by scaling D if the norm exceeds the chosen bound ? Is there a projection done at each SGD step ? It is not clear from the text how this works.
4. The authors might have expected this from reviewers, but anyway. Could the authors validate this model (in a revision of the paper) on different datasets, beside MNIST? It would be useful to know that you see the same split of hidden units for more complex datasets (say CIFAR-10)
5. The model had been run with only 2 temporal steps. Do you still get some kind of split between categorical and part hidden units ? Did you attempt to see how the number of temporal steps affect this division of units ?
References:
[1] Yoshua Bengio, Nicolas Boulanger-Lewandowski, Razvan Pascanu, Advances in Optimizing Recurrent Networks, arXiv:1212.0901
[–][+]
submission review by Yoshua Bengio
submission review by Yoshua Bengio
Review: Thank you for this interesting contribution. The differentiation of hidden units into class units and parts units is fascinating and connects with what I consider a central objective for deep learning, i.e., learning representations where the learned features disentangle the underlying factors of variation (as I have written many times in the past, e.g., Bengio, Courville & Vincent 2012). Why do you think this differentiation is happening? What are the crucial ingredients of your setup that are necessary to observe that effect?
I have a remark regarding this sentence on the first page: 'Recurrence opens the possibility of sharing parameters between successive layers of a deep network, potentially mitigating the vanishing gradient problem'. My intuition is that the vanishing/exploding gradient problem is actually *worse* with recurrent nets than with regular (unconstrained) deep nets. One way to visualize this is to think of (a) multiplying the same number with itself k times, vs (b) multiplying a k random numbers. Clearly, (a) will explode or vanish faster because in (b) there will be some 'cancellations'. Recurrent nets correspond to (a) because the weights are the same at all time steps (but yes, the non-linearities derivative will be different), whereas unconstrained deep nets correspond to (b) because the weight matrices are different at each layer.
Minor point about prior work: in the very old days I worked on using recurrent nets trained by BPTT to iteratively reconstruct missing inputs and produce discriminatively trained outputs. It worked quite well. NIPS'95, Recurrent Neural Networks for Missing or Asynchronous Data.
Regarding the results on MNIST, among the networks without convolution and transformations, one should add the Manifold Tangent Classifier (0.81% error), which uses unsupervised pre-training, the Maxout Networks with dropout (0.94%, no unsupervised pre-training), DBMs with dropout (0.79%, with unsupervised pre-training), and the deep convex networks (Yu & Deng, 0.83% also with unsupervised learning).
submission review by Richard Socher
submission review by Richard Socher