How Good is AI on Swiss Voting Booklets? A Multilingual OCR and Alignment Benchmark

Elina Stüssi; Jannis Vamvas

How Good is AI on Swiss Voting Booklets? A Multilingual OCR and Alignment Benchmark

Elina Stüssi, Jannis Vamvas

25 Mar 2026 (modified: 19 May 2026)SwissText 2026 Conference SubmissionEveryoneRevisionsCC BY 4.0

Track: Corpus Track

Keywords: multilingual NLP, optical character recognition, parallel corpus, text alignment, low-resource languages, document digitization, Romansh

TL;DR: We release a four-language parallel corpus from Swiss voting booklets and show that Gemini 2.5 Flash Lite and a hybrid Sentence-SwissBERT+Gemini pipeline are best for OCR and alignment respectively.

Abstract: Swiss federal voting booklets are an interesting resource for natural language processing due to their high editing standards and coverage of the four national languages of Switzerland (German, French, Italian, and Romansh Grischun). In this paper, we present VotingBooklets, an automatically extracted and aligned dataset, as well as VotingBooklets-Diamond, a subset that was manually corrected and verified by multiple annotators. We use the latter to benchmark a range of open and closed AI systems on two interdependent tasks: optical character recognition (OCR) and cross-lingual text alignment. Gemini 2.5 Flash Lite achieves the best OCR performance across all conditions, while a hybrid alignment approach using Sentence-SwissBERT for initial embedding-based alignment and Gemini for targeted post-hoc correction of low-confidence pairs yields the most accurate results. Applying these systems to the full collection of Swiss federal voting booklets, we release a large-scale four-language parallel corpus as a resource for low-resource NLP, multilingual representation learning, and the computational study of Swiss political discourse.

Submission Number: 53

Loading