Abstract: We investigate the use of denoising diffusion probabilistic models (DDPMs) with UNet backbones for multilingual character generation across diverse writing systems. Our study spans Devanagari, English, Arabic, and Mayan scripts, exploring both script-specific
and unified multi-script generation approaches. We systematically evaluate the effects of resolution scaling, training dataset size, and attention mechanisms on generation quality. Additionally, we incorporate conditional generation with T5 text embeddings to guide a unified model across scripts. Experiments show that diffusion models can faithfully learn individual script characters when trained in isolation. However, a unified model spanning multiple scripts faces significant challenges: crossscript generalization is sensitive to training
data diversity, the number of training epochs, and visual similarity among scripts. We find that increasing image resolution yields only
marginal quality gains unless accompanied by more training data. While T5-based conditioning improves control across scripts, overlapping character features across scripts still cause confusion. Our findings highlight both the potential of diffusion models for multilingual symbol generation and the practical challenges of achieving robust unified generation across diverse scripts.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, diffusion models
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 7173
Loading