Fully Quantized Transformer for Improved Translation

Gabriele Prato; Ella Charlaix; Mehdi Rezagholizadeh

Fully Quantized Transformer for Improved Translation

Gabriele Prato, Ella Charlaix, Mehdi Rezagholizadeh

25 Sept 2019 (modified: 22 Jun 2025)ICLR 2020 Conference Withdrawn SubmissionReaders: Everyone

TL;DR: We fully quantize the Transformer to 8-bit and improve translation quality compared to the full precision model.

Abstract: State-of-the-art neural machine translation methods employ massive amounts of parameters. Drastically reducing computational costs of such methods without affecting performance has been up to this point unsolved. In this work, we propose a quantization strategy tailored to the Transformer architecture. We evaluate our method on the WMT14 EN-FR and WMT14 EN-DE translation tasks and achieve state-of-the-art quantization results for the Transformer, obtaining no loss in BLEU scores compared to the non-quantized baseline. We further compress the Transformer by showing that, once the model is trained, a good portion of the nodes in the encoder can be removed without causing any loss in BLEU.

Keywords: Transformer, quantization, machine translation, compression, pruning

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/fully-quantized-transformer-for-improved/code)

Original Pdf: pdf

7 Replies

Loading