Track: Corpus Track
Keywords: multilingual NLP, optical character recognition, parallel corpus, text alignment, low-resource languages, document digitization, Romansh
TL;DR: We release a four-language parallel corpus from Swiss voting booklets and show that Gemini 2.5 Flash Lite and a hybrid Sentence-SwissBERT+Gemini pipeline are best for OCR and alignment respectively.
Abstract: Swiss federal voting booklets are an interesting resource for natural language processing due to their high editing standards and coverage of the four national languages of Switzerland (German, French, Italian, and Romansh Grischun). In this paper, we present VotingBooklets, an automatically extracted and aligned dataset, as well as VotingBooklets-Diamond, a subset that was manually corrected and verified by multiple annotators. We use the latter to benchmark a range of open and closed AI systems on two interdependent tasks: optical character recognition (OCR) and cross-lingual text alignment. Gemini 2.5 Flash Lite achieves the best OCR performance across all conditions, while a hybrid alignment approach using Sentence-SwissBERT for initial embedding-based alignment and Gemini for targeted post-hoc correction of low-confidence pairs yields the most accurate results. Applying these systems to the full collection of Swiss federal voting booklets, we release a large-scale four-language parallel corpus as a resource for low-resource NLP, multilingual representation learning, and the computational study of Swiss political discourse.
Submission Number: 53
Loading