GUIDEX: Guided Synthetic Data Generation for Zero-Shot Information Extraction

ACL ARR 2025 February Submission4760 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Information Extraction (IE) systems are traditionally domain-specific, requiring costly adaptation that involves expert schema design, data annotation, and model training. While Large Language Models (LLMs) have shown promise in zero-shot IE, performance degrades significantly in unseen domains where label definitions differ. This paper introduces GuideX, a novel method that automatically defines domain-specific schemas, infers guidelines, and generates synthetically labeled instances, allowing for better out-of-domain generalization. Fine-tuning LLaMa 3.1 with GuideX sets a new state-of-the-art across seven zero-shot Named Entity Recognition (NER) benchmarks. Models trained with GuideX gain up to 10 F1 points over previous methods without human-labeled data, and nearly 4 F1 points higher when combined with it. Models trained on GuideX demonstrating enhanced comprehension of complex, domain-specific annotation schemas. Code, models, and synthetic datasets will be released upon acceptance.
Paper Type: Long
Research Area: Information Extraction
Research Area Keywords: Information Extraction, Synthetic Data Generation, Large Language Models
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 4760
Loading