Efficient Unicode-Compatible Grammar-Constrained Decoding via String Homomorphism

ACL ARR 2024 June Submission2583 Authors

15 Jun 2024 (modified: 22 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Grammar-constrained decoding (GCD) is a powerful technique that enforces formal grammar constraints on the outputs of large language models (LLMs). This method ensures that generated text adheres to predefined structural rules, making it highly suitable for tasks requiring precise output formats. Despite its broad applications, the theoretical fundamentals of GCD remain underexplored, particularly in the context of formal language theory. In this work, we introduce the concept of tokenization as an inverse homomorphism, which maps the original string language to a token language defined on the alphabet of token IDs. The fact that tokenization is an inverse homomorphism is important for the efficiency of GCD, providing both a theoretical basis and an efficient construction method for the GCD algorithm. We further extend this framework to support Unicode characters, which are essential for multilingual NLP applications.
Paper Type: Long
Research Area: Syntax: Tagging, Chunking and Parsing
Research Area Keywords: parsing algorithms (symbolic, theoretical results), parsing and related tasks, inference methods, code generation and understanding
Contribution Types: Theory
Languages Studied: Formal Language,
Submission Number: 2583
Loading