Symmetric Dot-Product Attention for Efficient Training of BERT Language ModelsDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Initially introduced as a machine translation model, the Transformer architecture has now become the foundation for modern deep learning architecture, with applications in a wide range of fields, from computer vision to natural language processing. Nowadays, to tackle increasingly more complex tasks, Transformer-based models are stretched to enormous sizes, requiring increasingly larger training datasets, and unsustainable amount of compute resources. The ubiquitous nature of the Transformer and its core component, the attention mechanism, are thus prime targets for efficiency research. In this work, we propose an alternative compatibility function for the self-attention mechanism introduced by the Transformer architecture. This compatibility function exploits an overlap in the learned representation of the traditional scaled dot-product attention, leading to a symmetric with pairwise coefficient dot-product attention. When applied to the pre-training of BERT-like models, this new symmetric attention mechanism reaches a score of 79.36 on the GLUE benchmark against 78.74 for the traditional implementation, leads to a reduction of 6% in the number of trainable parameters, and reduces the number of training steps required before convergence by half.
Paper Type: long
Research Area: Machine Learning for NLP
Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English
Preprint Status: There is no non-anonymous preprint and we do not intend to release one.
A1: yes
A1 Elaboration For Yes Or No: In the Limitations section.
A2: no
A2 Elaboration For Yes Or No: We worked on efficiency of already established techniques, our contributions doesn't give those techniques new capabilities.
A3: yes
A3 Elaboration For Yes Or No: The Abstract and Introduction sections summarize our main contributions.
B: yes
B1: yes
B1 Elaboration For Yes Or No: We used data from OSCAR, GLUE, source code from Hugging Face, for which we provided the necessary citation.
B2: yes
B2 Elaboration For Yes Or No: The assets we used are under Creative Commons CC0 1.0 license or OpenSource Software Apache License 2.0
B3: yes
B3 Elaboration For Yes Or No: The data used is made available for researcher
B4: yes
B4 Elaboration For Yes Or No: See appendix on filtering OSCAR data
B5: yes
B5 Elaboration For Yes Or No: We describe the source, the quantity, and the kind of data used in our experiment. But as our work focuses on unsupervised training, we didn't endeavour any further.
B6: yes
B6 Elaboration For Yes Or No: Information available in the section 3 Experiments sub-section Pre-Training Dataset
C: yes
C1: yes
C1 Elaboration For Yes Or No: Information available in the section 3 Experiments sub-section Model Architectures,
C2: yes
C2 Elaboration For Yes Or No: Information available in the section 3 Experiments sub-section Pre-Training Setup and Fine-Tuning Setup.
C3: yes
C3 Elaboration For Yes Or No: In section 4 Results sub-section GLUE Benchmark Fine-Tuning we provide aggregates (mean and standard deviation) for our benchmark evaluation.
C4: yes
C4 Elaboration For Yes Or No: In section Reproducibility Statement
D: no
D1: n/a
D2: n/a
D3: n/a
D4: n/a
D5: n/a
E: no
E1: n/a
0 Replies

Loading