Abstract: Highlights•Employs joint attention for the incorporation of BERT into NMT models.•Makes use of the representations of BERT’s intermediate layers.•Employs a three-phase optimization strategy to overcome catastrophic forgetting.•Studies how the size of BERT impacts the performance of NMT models.
Loading