Keywords: acronym matching, weak supervision, monotonic sequence alignment, beam search, low-resource NLP
Abstract: Aligning acronyms to their long forms is a critical but underexplored problem in entity resolution and text retrieval. There are many ways a long form can be compressed into its acronym (e.g. initials, syllabus, partial words) and character-level annotations are rarely available in real-world data. To address these challenges, we present a weakly supervised approach that formulates acronym-long-form mapping as a monotonic character subsequence alignment task. First, we generate weak alignment labels by combining positional weights with a beam-search decoder. Next, the weak labels are used to train a character-level sequence labeller that predicts, for each long-form character, the likelihood that it is part of the acronym. During inference, we perform a secondary beam search over the character-level scores to recover the most probable acronym-long-form mapping. Based on our experiments on three datasets curated from publicly available sources, our approach outperforms heuristic baselines. Additionally, it achieves comparable performance to variants trained on weak labels generated by large language models (LLMs), while requiring substantially less compute. This underscores its efficacy for low-resource, real-world environments.
Paper Type: Long
Research Area: Low-resource Methods for NLP
Research Area Keywords: NLP in resource-constrained settings, data-efficient training
Contribution Types: NLP engineering experiment, Reproduction study, Approaches to low-resource settings, Data resources
Languages Studied: English
Submission Number: 4489
Loading