Bridging the Gap: Aligning Language Model Generation with Structured Information Extraction via Controllable State Transition

Hao Li; Yubing Ren; Yanan Cao; Yingjie Li; Fang Fang; Zheng Lin; Shi Wang

Bridging the Gap: Aligning Language Model Generation with Structured Information Extraction via Controllable State Transition

Hao Li, Yubing Ren, Yanan Cao, Yingjie Li, Fang Fang, Zheng Lin, Shi Wang

Published: 29 Jan 2025, Last Modified: 29 Jan 2025WWW 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Track: Web mining and content analysis

Keywords: Information Extraction, Large Language Model, Few-shot Learning, Structure Generation

Abstract: Large language models (LLM) achieve superior performance in generative tasks. However, due to the natural gap between language model generation and structured information extraction in three dimensions: task type, output format, and modeling granularity, they often fall short in structured information extraction, a crucial capability for effective data utilization on the web. In this paper, we define the generation process of the language model as the controllable state transition, aligning the generation and extraction processes to ensure the integrity of the output structure and adapt to the goals of the information extraction task. Furthermore, we propose the Structure2Text decider to help the language model understand the fine-grained extraction information, which converts the structured output into natural language and makes state decisions, thereby focusing on the task-specific information kernels, and alleviating language model hallucinations and incorrect content generation. We conduct extensive experiments and detailed analyses on myriad information extraction tasks. Our method not only achieves significant performance improvements but also ensures the integrity of the output structure, making it easy to parse the extracted content.

Submission Number: 1168

Loading