Abstract: Health literacy is crucial yet often hampered by complex medical terminology. Existing simplification approaches are limited by small, sentence-level, and monolingual datasets. To address this, we introduce MedSiML, a large-scale dataset designed to simplify and translate medical texts into the ten most spoken languages, improving global health literacy. MedSiML includes over 64,000 paragraphs from PubMed, Wikipedia, and Cochrane reviews, simplified into English, Mandarin, Spanish, Arabic, Hindi, Bengali, Portuguese, Russian, Japanese, and Punjabi, with an additional super-simplified English version for those with learning disabilities. We detail MedSiML’s creation, including data sourcing, cleaning, and annotation using the Gemini Flash-1.5 model. We fine-tuned the Text-To-Text Transfer Transformer (T5) base model on this paragraph-level, multilingual data, achieving significant improvements: 10.61% in Recall-Oriented Understudy for Gisting Evaluation 1 (ROUGE1), 11.01% in the Sentence-level Quality Estimation (SARI) score, and 49.1% in semantic similarity over previous state-of-the-art models. Experimental results show that the Flesch-Kincaid (FK) and Automated Readability Index (ARI) readability scores are improved by 0.38 and 1.06, respectively, with no significant changes in the Bilingual Evaluation Understudy (BLEU) score.
External IDs:doi:10.1007/978-981-96-6606-5_12
Loading