Symmetric Dot-Product Attention for Efficient Training of BERT Language Models

Anonymous

Symmetric Dot-Product Attention for Efficient Training of BERT Language Models

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Initially introduced as a machine translation model, the Transformer architecture has now become the foundation for modern deep learning architecture, with applications in a wide range of fields, from computer vision to natural language processing. Nowadays, to tackle increasingly more complex tasks, Transformer-based models are stretched to enormous sizes, requiring increasingly larger training datasets, and unsustainable amount of compute resources. The ubiquitous nature of the Transformer and its core component, the attention mechanism, are thus prime targets for efficiency research. In this work, we propose an alternative compatibility function for the self-attention mechanism introduced by the Transformer architecture. This compatibility function exploits an overlap in the learned representation of the traditional scaled dot-product attention, leading to a symmetric with pairwise coefficient dot-product attention. When applied to the pre-training of BERT-like models, this new symmetric attention mechanism reaches a score of 79.36 on the GLUE benchmark against 78.74 for the traditional implementation, leads to a reduction of 6% in the number of trainable parameters, and reduces the number of training steps required before convergence by half.

Paper Type: long

Research Area: Machine Learning for NLP

Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models

Languages Studied: English

Preprint Status: There is no non-anonymous preprint and we do not intend to release one.

A1: yes

A1 Elaboration For Yes Or No: In the Limitations section.

A2: no

A2 Elaboration For Yes Or No: We worked on efficiency of already established techniques, our contributions doesn't give those techniques new capabilities.

A3: yes

A3 Elaboration For Yes Or No: The Abstract and Introduction sections summarize our main contributions.

B: yes

B1: yes

B1 Elaboration For Yes Or No: We used data from OSCAR, GLUE, source code from Hugging Face, for which we provided the necessary citation.

B2: yes

B2 Elaboration For Yes Or No: The assets we used are under Creative Commons CC0 1.0 license or OpenSource Software Apache License 2.0

B3: yes

B3 Elaboration For Yes Or No: The data used is made available for researcher

B4: yes

B4 Elaboration For Yes Or No: See appendix on filtering OSCAR data

B5: yes

B5 Elaboration For Yes Or No: We describe the source, the quantity, and the kind of data used in our experiment. But as our work focuses on unsupervised training, we didn't endeavour any further.

B6: yes

B6 Elaboration For Yes Or No: Information available in the section 3 Experiments sub-section Pre-Training Dataset

C: yes

C1: yes

C1 Elaboration For Yes Or No: Information available in the section 3 Experiments sub-section Model Architectures,

C2: yes

C2 Elaboration For Yes Or No: Information available in the section 3 Experiments sub-section Pre-Training Setup and Fine-Tuning Setup.

C3: yes

C3 Elaboration For Yes Or No: In section 4 Results sub-section GLUE Benchmark Fine-Tuning we provide aggregates (mean and standard deviation) for our benchmark evaluation.

C4: yes

C4 Elaboration For Yes Or No: In section Reproducibility Statement

D: no

D1: n/a

D2: n/a

D3: n/a

D4: n/a

D5: n/a

E: no

E1: n/a

0 Replies

Loading