MEDLINE as a Parallel Corpus: a Survey to Gain Insight on French-, Spanish- and Portuguese-speaking Authors’ Abstract Writing Practice
Abstract: Background: Parallel corpora are used to train and evaluate machine translation systems. To alleviate the cost of producing parallel
resources for evaluation campaigns, existing corpora are leveraged. However, little information may be available about the methods
used for producing the corpus, including translation direction. Objective: To gain insight on MEDLINE parallel corpus used in the
biomedical task at the Workshop on Machine Translation in 2019 (WMT 2019). Material and Methods: Contact information for the
authors of MEDLINE articles included in the English/Spanish (EN/ES), English/French (EN/FR), and English/Portuguese (EN/PT)
WMT 2019 test sets was obtained from PubMed and publisher websites. The authors were asked about their abstract writing practices
in a survey. Results: The response rate was above 20%. Authors reported that they are mainly native speakers of languages other than
English. Although manual translation, sometimes via professional translation services, was commonly used for abstract translation,
authors of articles in the EN/ES and EN/PT sets also relied on post-edited machine translation. Discussion: This study provides a
characterization of MEDLINE authors’ language skills and abstract writing practices. Conclusion: The information collected in this
study will be used to inform test set design for the next WMT biomedical task.
0 Replies
Loading