Discrete TransformerDownload PDF

25 Sept 2019 (modified: 05 May 2023)ICLR 2020 Conference Withdrawn SubmissionReaders: Everyone
TL;DR: Discrete transformer which uses hard attention to ensure that each step only depends on a fixed context.
Abstract: The transformer has become a central model for many NLP tasks from translation to language modeling to representation learning. Its success demonstrates the effectiveness of stacked attention as a replacement for recurrence for many tasks. In theory attention also offers more insights into the model’s internal decisions; however, in practice when stacked it quickly becomes nearly as fully-connected as recurrent models. In this work, we propose an alternative transformer architecture, discrete transformer, with the goal of better separating out internal model decisions. The model uses hard attention to ensure that each step only depends on a fixed context. Additionally, the model uses a separate “syntactic” controller to separate out network structure from decision making. Finally we show that this approach can be further sparsified with direct regularization. Empirically, this approach is able to maintain the same level of performance on several datasets, while discretizing reasoning decisions over the data.
Keywords: transformer, natural language processing, attention
Original Pdf: pdf
5 Replies

Loading