Realistic Training Data Generation and Rule Enhanced Decoding in LLM for NameGuess

Realistic Training Data Generation and Rule Enhanced Decoding in LLM for NameGuess

ACL ARR 2024 December Submission527 Authors

14 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Column name expansion in tabular data, known as NameGuess, is potentially benefiting a wide range of table-centric tasks in natural language processing and database management. Recent work proposes solving this task by tuning Large Language Models (LLMs) using synthetic rule-generated training data under the table context. While previous work has made significant strides, we identify two key limitations: the unrealistic nature of rule-generated abbreviations in training data and the persistent divergence problem in LLM outputs. To address the first issue, we propose a novel approach that integrates a subsequence abbreviation generator trained on human-annotated data with the introduction of non-subsequence abbreviations in the training set. To address the second issue, we propose a decoding system constrained on a robust automaton that represents the basic rules of abbreviation expansion. We extended the original English NameGuess test set to include non-subsequence and PinYin scenarios. Experimental results show that properly tuned 7/8B moderate-size LLMs with a refined decoding system can surpass the few-shot performance of state-of-the-art LLMs, such as the GPT-4 series, which is widely believed to have over 100B parameters. The code and data are presented in the supplementary material.

Paper Type: Long

Research Area: Generation

Research Area Keywords: Generation, Language Modeling, Machine Learning for NLP

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English, Chinese

Submission Number: 527

Loading