Weakly Supervised Monotonic Character Alignment for Acronym-Long-Form Mapping

Weakly Supervised Monotonic Character Alignment for Acronym-Long-Form Mapping

ACL ARR 2026 January Submission4489 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: acronym matching, weak supervision, monotonic sequence alignment, beam search, low-resource NLP

Abstract: Aligning acronyms to their long forms is a critical but underexplored problem in entity resolution and text retrieval. There are many ways a long form can be compressed into its acronym (e.g. initials, syllabus, partial words) and character-level annotations are rarely available in real-world data. To address these challenges, we present a weakly supervised approach that formulates acronym-long-form mapping as a monotonic character subsequence alignment task. First, we generate weak alignment labels by combining positional weights with a beam-search decoder. Next, the weak labels are used to train a character-level sequence labeller that predicts, for each long-form character, the likelihood that it is part of the acronym. During inference, we perform a secondary beam search over the character-level scores to recover the most probable acronym-long-form mapping. Based on our experiments on three datasets curated from publicly available sources, our approach outperforms heuristic baselines. Additionally, it achieves comparable performance to variants trained on weak labels generated by large language models (LLMs), while requiring substantially less compute. This underscores its efficacy for low-resource, real-world environments.

Paper Type: Long

Research Area: Low-resource Methods for NLP

Research Area Keywords: NLP in resource-constrained settings, data-efficient training

Contribution Types: NLP engineering experiment, Reproduction study, Approaches to low-resource settings, Data resources

Languages Studied: English

Submission Number: 4489

Loading