Realistic Training Data Generation and Rule Enhanced Decoding in LLM for NameGuess

ACL ARR 2024 December Submission527 Authors

14 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Column name expansion in tabular data, known as NameGuess, is potentially benefiting a wide range of table-centric tasks in natural language processing and database management. Recent work proposes solving this task by tuning Large Language Models (LLMs) using synthetic rule-generated training data under the table context. While previous work has made significant strides, we identify two key limitations: the unrealistic nature of rule-generated abbreviations in training data and the persistent divergence problem in LLM outputs. To address the first issue, we propose a novel approach that integrates a subsequence abbreviation generator trained on human-annotated data with the introduction of non-subsequence abbreviations in the training set. To address the second issue, we propose a decoding system constrained on a robust automaton that represents the basic rules of abbreviation expansion. We extended the original English NameGuess test set to include non-subsequence and PinYin scenarios. Experimental results show that properly tuned 7/8B moderate-size LLMs with a refined decoding system can surpass the few-shot performance of state-of-the-art LLMs, such as the GPT-4 series, which is widely believed to have over 100B parameters. The code and data are presented in the supplementary material.
Paper Type: Long
Research Area: Generation
Research Area Keywords: Generation, Language Modeling, Machine Learning for NLP
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English, Chinese
Submission Number: 527
Loading