The Art of Inhibiting Patterns: Gaussian Process Optimization of Hyphenation

ACL ARR 2026 January Submission6264 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: hyphenation, inhibiting patterns, Gaussian process optimization, hyphenation patterns
Abstract: The quality of hyphenation matters. It does matter when rendering texts on small mobile screens, when rendering HTML in any browser, and it matters in the narrow columns of academic papers, such as in this abstract. Having an automated, robust, efficient, and generalizable solution for all languages in use would have saved billions of hours of scrolling and millions of tons of paper. The half-century of attempts to solve the hyphenation problem yielded usable solutions with TeX hyphenation patterns, but with low coverage and errors caused mainly through insufficient exception handling, compound words, and language changes (including new words, transcriptions). The setting of parameters of the pattern generation process is still done by trial and error. We have designed a new method of Gaussian process optimization of hyphenation pattern generation by the Patgen program. It optimizes threshold setting for coverage and inhibition levels of pattern generation. We utilized a collection of 18 diverse datasets comprising a hyphenated word list for 15 languages and generated nearly optimal hyphenation patterns with respect to precision, recall, and generalization (10-fold cross-validation) optimization criteria. The $F_{1/7}$ score reaches 0.9992 for consistently marked word lists, such as Czech and Slovak. A newly designed optimization technique for the pattern generation methodology has set new quality standards, approaching an optimal solution. Its deployment in typesetting systems like TeX, InDesign, or in text processors, as well as in CSS hyphenation in web browsers, would yield a huge impact.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: optimization methods, generalization, Gaussian process
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: Czech (cs), Dutch (nl), German (de), Greek (el), Icelandic (is), Italian (it), Malay (ms), Polish (pl), Portuguese (pt), Russian (ru), Slovak (sk), Spanish (es) , Thai (th) , Turkish (tr) , Ukrainian (uk)
Submission Number: 6264
Loading