Recursive neural network models and their accompanying vector representations for words have seen success in an array of increasingly semantically sophisticated tasks, but almost nothing is known about their ability to accurately capture the aspects of linguistic meaning that are necessary for interpretation or reasoning. To evaluate this, I train a recursive model on a new corpus of constructed examples of logical reasoning in short sentences, like the inference of "some animal walks" from "some dog walks" or "some cat walks," given that dogs and cats are animals. The results are promising for the ability of these models to capture logical reasoning, but the model tested here appears to learn representations that are quite specific to the templatic structures of the problems seen in training, and that generalize beyond them only to a limited degree.
State From To ( Cc) Subject Date Due Action
New Request
Sam Bowman Conference Track
Request for Endorsed for oral presentation: Can recursive neural tensor networks...

23 Dec 2013
Reveal: document
Sam Bowman
Revealed: document: Can recursive neural tensor networks learn logical reasoning?

23 Dec 2013
Completed
Conference Track Anonymous 7747
Request for review of Can recursive neural tensor networks learn logical...

14 Jan 2014 04 Feb 2014
Completed
Conference Track Anonymous 8e44
Request for review of Can recursive neural tensor networks learn logical...

14 Jan 2014 04 Feb 2014
Completed
Conference Track Anonymous e76d
Request for review of Can recursive neural tensor networks learn logical...

14 Jan 2014 04 Feb 2014

9 Comments

Sam Bowman 23 Dec 2013
State From To ( Cc) Subject Date Due Action
Reveal: document
Sam Bowman
Revealed: document:

23 Dec 2013
Source code and data are available here: http://goo.gl/PSyF5u I'll be updating the paper shortly to add a link to the text.
Please log in to comment.
Anonymous 8e44 31 Jan 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 8e44
Revealed: document: review of Can recursive neural tensor networks learn logical...

31 Jan 2014
Fulfill
Anonymous 8e44 Conference Track
Fulfilled: Request for review of Can recursive neural tensor networks learn...

31 Jan 2014
In this work the author investigates how effective the vector representation of words is for the task of logical inference. A set of seven entailment relations from MaCartney are used, and a data set of 12,000 logical statements (of pairs of sentences) are generated from these relations and from 41 predicate tokens. The task is multiclass classification, where given two sentences, the system must output the correct relation between them. A simple recursive tensor network is used. The study is limited to considering quantifiers like "some" and "all", which have clear monotonicity properties. Results show that the model can learn but that generalization is limited. Unfortunately because the training process converges to an inferior model, results are given after very early stopping, in which subsequent iterations can give widely different results. This is an exciting direction for research and it's great to see it being tackled. Unfortunately, however, the paper is unclear in crucial places, and the training methodology is questionable. I would encourage the author to clarify the paper (especially for the likely non-linguist audience) and strengthen the training algorithm (in order to demonstrate usefully reproducible results). Even if the results remain negative, this would then still be of significant value to the community. Specific comments: Section 2 --------- Your example of "some dogs bark" seems confused. For both arguments of "some" (not just the first), the inference works if the argument is replaced by something more general. You write that 'some' is downward monotonic in its second argument, but your examples show upward monotonicity in both. (Specifically, you write "The quantifier 'some' is upward monotone in its first argument because it permits substitution of more general terms, and downward monotone in its second argument because it permits the substitution of more specific terms." - but in the same paragraph you also write that "some" is upward monotonic in both arguments.) Readers who are asked to expend mental energy on disentangling unnecessary confusions like this can quickly lose motivation. Table 1: this table is central to your work, but it needs more explanation. What is calligraphic D? You seem to be using the "hat" operator in two different senses (column 2 versus column 3). What does "else" mean in column 3 - how exactly is independence defined? The whole paper rests on MacCartney's framework, so I think it's necessary to explain more about this scheme here. In particular, I do not understand your "no animals bark | some dogs bark" example (and I fear most others won't, too). Typo: much hold --> must hold Section 3 --------- "Several pretraining regimes meant to initialize the word vectors to reflect the relations between them were tried, but none offered a measurable improvement to the learned model" - please say which were tried. In particular, did you try fixed, off-the-shelf vectors, for example from Socher's work, or trained using a large unlabeled dataset using Mikolov's Word2Vec? I counted 4624 parameters in the composition parameters, 13,005 for the comparison layer (with dimension 45), and 800 for the 50 (16 dimensional) word vectors, giving a total of 18,429 parameters. Your training set size is quite a bit smaller than this, and regularization can only help so much. I wonder if the limited results and difficulty of training (converging, but to poor solutions) just indicates the need for more training data. You can test this by generating a training curve - that is, plot performance on a validation set for various training set sizes (when trained to convergence, and using whatever regularization you settle on). If the curve is still steep when using all the training data, then more data will help. If it's flat, then the task may not be learnable with the model used. Section 4 --------- Your description of the "basic monotonicity" datasets was unclear to me. Does 1, 2 refer to Table 2? If so it's not clear how "In some of the datasets (as in 1), this alternation is in the first argument, in some the second argument (as in 2), and in some both." Section 5 --------- It is not very surprising that the model can learn the all-split data, since the training data is so tightly constrained. Set-out is also very close to the training data. I found that exactly how the data splits were done, and what was tested on, was unclear. typo: "the it is" "I choose one of three target datasets" - how did you choose the three? (From the 200?) "potentially other similar datasets..." is imprecise. How did you choose? "The model did not converge well for any of these experiments: convergence can take hundreds or thousands of passes through the data, and the performance of the model at convergence on test data was generally worse than its performance during the first hundred or so iterations. To sidestep this problem somewhat, I report results here for the models learned after 64 passes through the data." I'm afraid that this greatly reduces the value of these results (they are close to being irreproducible). The training algorithm should at least converge, or be more reproducible than this. (If the test error is still fluctuating wildly on the stopping iteration, other small changes, e.g. in the data, may give completely different results). Section 6 --------- "Pessimistically... Optimistically... " this is speculation (neither is supported by the experiments) and so I don't think it adds much value.
Please log in to comment.
Sam Bowman 04 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Sam Bowman
Revealed: document:

04 Feb 2014
Thanks for your comments. I am updating the paper now with some clarifications and typo repairs, and I'm in the process of setting up a few follow up experiments. Section 2: Thanks for pointing out the unclear bits here, especially that “some dogs bark” example, which I seem to have broken during some hasty final revisions. I'll post an updated version with this fixed shortly. To clarify some details here: - “Some" is, in fact, upward monotone in both arguments. - D is the domain of containing all possible objects of the type being compared. - The “^” symbol in column three was typeset incorrectly, and is meant to represent logical AND. Section 3: I did not try initializing the vectors with those used in any previous experiments (e.g. Socher’s or Mikolov’s). While that kind of initialization sounds promising in general, I think that the unambiguous fragment of English that I use is so different from ordinary English usage that it is unlikely that outside information from these sources would be helpful to the task. The pretraining settings that I experimented with involved first training the model on some or all of the pairs of individual words from Appendix B, annotated with the relations between them. I'm certainly sensitive to the concern that the model might be overparameterized, and I will see about getting a training curve together in the next week or two. Section 4: The numbers referenced in those subsections do refer to Table 2, but the "(as in 2)" reference is a mistake. Example 2 in Table 2 corresponds to "Monotonicity with quantifier substitution.” Thanks for catching that, and expect a fix soon. Section 5: I agree that the all-split result is unsurprising, though I think it is useful as a sanity check to ensure that the model structure is usable for the task, and that the model isn't dramatically *under*parameterized. The three target datasets were chosen by hand: the choice of a fairly small number was necessary due to resource constraints, but the choices were arbitrary. I chose to focus on quantifier substation datasets so as to render the three settings (the last three columns of Table 4) most easily comparable across the three target datasets. The reference to "potentially other similar datasets” could have been better put, but it refers to the fact that in each of the three experimental settings reported in Table 4, different criteria are used to decide which datasets are held out in training, and that all of these criteria involve how similar a given dataset is to the target dataset. You raise an important point about reproducibility. I would appreciate any suggestions about better ways to report results given the fluctuations during training. I may try to report statistics over the model’s performance over a range of iterations, or statistics over the model’s performance at a given iteration over several random re-initializations, and I am experimenting further with different ways of encouraging the model to converge. I would like to suggest that even the current results do show a broader reproducible pattern: high performance on SET-OUT and SUBCL.-OUT is possible but subject to instabilities in the training algorithm, whereas high performance on PAIR-OUT cannot be demonstrated with this model as configured.
Please log in to comment.
Anonymous 7747 06 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 7747
Revealed: document: review of Can recursive neural tensor networks learn logical...

06 Feb 2014
Fulfill
Anonymous 7747 Conference Track
Fulfilled: Request for review of Can recursive neural tensor networks learn...

06 Feb 2014
The paper tries to determine whether representations constructed with recursive embeddings can be used to support simple reasoning operations. The essential idea is to train an additional comparison layer that takes the representations of two sentences and produces an output that describes the relation between the two sentences (entailment, equivalence, etc.) This approach is in fact closely related to the "restricted entailment operator" suggested near the end of Bottou's white paper http://arxiv.org/pdf/1312.6192v3.pdf. Experiments are carried out using a vastly simplified language and Socher's supervised training technique. According to the author, the results are a mixed bag. On the one hand, the system can learn to reason on sentences whose structure matches that of the training sentences. On the other hand, performance quickly degrades when using sentences whose structure did not appear in the training set. My reading of these results is much more pessimistic. I find completely unsurprising that the system can learn to "reason" on sentences with known structure. On the other hand, the inability of the system to reason on sentences with new structure indicates that the recursive embedding network did not perform what was expected. The key of the recursive structure is to share weights across all applications of the grouping layer. This weight sharing was obviously insufficient to induce a bias that helps the system generalize to other structures. Whether this is a simple optimization issue or a more fundamental problem remains to be determined. My understanding is that the author always trains the system using the correct parsing structure in a manner similar to Socher's initial work (please confirm). It would be very interesting to investigate whether one obtains substantially different results if one trains the system using incorrect parsing structures (either a random structures or a left-to-right structure). Worse results would indicate that the structure of the recursive embeddings matters. Similar results would confirm the results reported in http://arxiv.org/abs/1301.2811 and strongly suggest that recursive embeddings do not live up to expectations. This would of course be a negative results, but negative results are sometimes more informative than mixed bags (and in my opinion very worth publishing.)
Please log in to comment.
Sam Bowman 07 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Sam Bowman
Revealed: document:

07 Feb 2014
I agree that the results so far are not as strongly positive or negative as would be ideal, and I hope to be able to report somewhat more conclusive results about the behavior of the optimization techniques that I use (see comments above), but I think that the results presented so far are informative about the ability of models like this to do RTE more generally. The SET-OUT results show that the model is able to learn to identify the difference between two unseen sentences and, if that difference has been seen before, return a consistent label that corresponds to that difference. Perhaps more important is the fact that the model shows 100% accuracy on unseen examples like "some dog bark [entails] some animal bark” (seen in ALL-SPLIT for example) where lexical items differ between sides. Here the model is both learning to do this reasoning about differences, and learning to use information about entailment between lexical items (animal > dog) in novel environments. As you suggest, I do use correct hand-assigned parses in both training and testing. I agree that it would be interesting to see what effect using randomly assigned parses instead would have, and I may be able to get those numbers at least by the conference date. It does seem worth mentioning, though, that the sentences are mostly three or four words long, so I would expect that the parse structure would be far less important in these experiments than in ones with longer sentences (and thus more deeply nested tree structures), since every word is already quite close to the top of the composition tree regardless of the structure here. Since you brought up the (important) Scheible and Schuetze paper, I should mention that the prior motivation for using high quality parse structures for this task is considerably stronger than the motivation for using them in binary sentiment tasks like the one reported on in that paper. In binary sentiment labeling, the label is largely (but not entirely) dependent on the presence or absence of strongly sentiment expressing words, and decent performance (~80%) can be achieved using simple regression models with bigram or even unigram features. I don’t have exactly comparable numbers for the dataset that I present in this paper, but RTE/NLI does not lend itself to comparable quality baselines with simple features. My task is deliberately easier than the RTE challenge datasets, but the average tuned model submitted to the first RTE workshop in 2005 got less than 55% accuracy on *binary* entailment classification. There is some related discussion in the review thread for the Scheible and Schuetze paper: http://openreview.net/document/e2ffbffb-ba93-43d0-9102-f3e756e3f63c Thanks for the Bottou comparison, by the way. This does seem to me to be implementation of a slightly generalized version of his proposed restricted entailment operator, and I had not previously noticed that parallel.
Please log in to comment.
Anonymous e76d 07 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous e76d
Revealed: document: review of Can recursive neural tensor networks learn logical...

07 Feb 2014
Fulfill
Anonymous e76d Conference Track
Fulfilled: Request for review of Can recursive neural tensor networks learn...

07 Feb 2014
This paper investigates the use of a recurrent model for logical reasoning in short sentences. An important part of the paper is dedicated to the description on the task and the way the author simplifies the task of MacCartney to keep only entailment relations that are non ambiguous. For the model, a simple recurrent tensor (from Socher's work) network is used. While the more general task defined by MacCartney is well described, the reduced task addressed in this paper is more unclear. The motivation stands: this is a great idea to reduce the task to non ambiguous cases, for which we could better interpret the experimental results. However, at the end, it is difficult to draw relevant conclusion from the experiments, and a lot of technical details are missing to yield the results reproducible. Maybe the author tried to lessen the negative aspects of the results, but it would be really more interesting to clearly describe negative results. My opinion is that this paper is not well suited for the conference track, and maybe it should be submitted to the workshop track.
Please log in to comment.
Sam Bowman 07 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Sam Bowman
Revealed: document:

07 Feb 2014
Thanks for your comment. I absolutely intend for this paper to describe a reproducible result, and I would hope that the citations and provided code would clarify any details that were omitted in the text. I would appreciate it if you could let me know what details you found unclear. If your concerns are centered on the random noise in the results, and the issues related to early stopping, I do see that as a real issue. I am working to find a way to either encourage the model to converge more reliably, or else to at least report statistics over its behavior across runs. The paper does contain some negative results as you suggest—the model was only successful at some parts of the task—and I would like to explore those results as fully as possible. Is there anything in particular about the reporting of these results that you think could be clearer or more thorough?
Please log in to comment.
Sam Bowman 14 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Sam Bowman
Revealed: document:

14 Feb 2014
I have a fairly major unexpected update to report. In attempting to respond to 8e44's concerns about using early stopping to get around convergence issues, I discovered a mistake in my implementation of AdaGrad. In short, fixing that mistake led to much more consistent convergence and better results, including strong performance on the PAIR-OUT test settings, suggesting that the model is much more capable than I had previously suggested at generalizing to unseen reasoning patterns. I realize that this is somewhat late in the review process to make substantial changes, but a new version of the paper is pending on ArXiv and should be live by Monday. The results table and the discussion section have been replaced. I will also be updating the source code linked to above and in the paper before Monday to reflect this bug fix, and a couple of small improvements to the way that cost and test error is reported during training. If you are interested in what went wrong: I accidentally set up SGD with AdaGrad in such a way that it the sum of squared gradients after every full pass of the data, equivalent to every few hundred gradient updates. Since this sum is used to limit the size of the gradient updates, resetting it this often prevented the model from reliably converging without hurting gradient accuracy or preventing it from converging occasionally.
Please log in to comment.
Sam Bowman 15 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Sam Bowman
Revealed: document:

15 Feb 2014
While the arXiv paper is being held in the queue before publication, you can view the revised paper using this temporary link: http://www.stanford.edu/~sbowman/arxiv_submission.pdf
Please log in to comment.

Please log in to comment.