Abstract: In this paper, we study the empirical effects of two major approaches to augmenting Transformers with a recurrent mechanism: (1) the approach of incorporating a depth-wise recurrence similar to Universal Transformers; and (2) the approach of incorporating a chunk-wise
temporal recurrence like Temporal Latent Bottleneck. Furthermore, we propose and investigate novel ways to extend and combine the above methods - for example, we propose a global mean-based dynamic halting mechanism for Universal Transformers and an augmentation of Temporal Latent Bottleneck with elements from Universal Transformer. We compare the models and probe their inductive biases in several diagnostic tasks, such as Long Range Arena (LRA), flip-flop language modeling, ListOps, and Logical Inference. The
code is released in the supplementary.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: 1. Overhauled conclusion (section 8)
2. Task descriptions of logical inference (6.6) and LRA (6.5)
3. More discussion on logical inference (6.5)
4. Minor claim changes - removing mention of "inductive biases" and emphasizing more on empirical comparison of two forms of recurrences.
5. Adding more context about ListOps example in Introduction.'
6. Fix mathematical expressions and add related descriptions in section 3.
7. Update section 4.1 with more details about motivating dynamic halt.
8. Additional motivation of gating mechanism (4.3)
9. Remove mistaken claim about the performance of Universal Transformer + xPos for ListOps (6.4)
10. Add mean+std results and discussion for FlipFlop including additional discussion. (Table 3)
11. Elaborate on mean pooling (Eqn 14)
12. Other minor changes discussed in the response to the rebuttals.
Most changes are temporarily highlighted in purple for convenience of reviewers.
Assigned Action Editor: ~Jasper_Snoek1
Submission Number: 3281
Loading