Abstract: Cues from multiple modalities have been successfully applied in several fields of natural language processing including machine translation (MT). However, the application of multimodal cues in low-resource MT (LRMT) is still an open research problem. The main challenge of LRMT is the lack of abundant parallel data which makes it difficult to build MT systems for a reasonable output. Using multimodal cues can provide additional context and information that can help to mitigate this challenge. To address this challenge, we present a multimodal machine translation (MMT) dataset of low-resource languages. The dataset consists of images, audio and corresponding parallel text for a low-resource language pair that is Manipuri–English. The text dataset is collected from the news articles of local daily newspapers and subsequently translated into the target language by translators of the native speakers. The audio version by native speakers for the Manipuri text is recorded for the experiments. The study also investigates whether the correlated audio-visual cues enhance the performance of the machine translation system. Several experiments are conducted for a systematic evaluation of the effectiveness utilizing multiple modalities. With the help of automatic metrics and human evaluation, a detailed analysis of the MT systems trained with text-only and multimodal inputs is carried out. Experimental results attest that the MT systems in low-resource settings could be significantly improved up to +2.7 BLEU score by incorporating correlated modalities. The human evaluation reveals that the type of correlated auxiliary modality affects the adequacy and fluency performance in the MMT systems. Our results emphasize the potential of using cues from auxiliary modalities to enhance machine translation systems, particularly in situations with limited resources.
Loading