Exploring Synthetic Data Generation Techniques for Employment Type Classification in Job Advertisements

ACL ARR 2024 June Submission5206 Authors

16 Jun 2024 (modified: 17 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The classification of employment types in online job advertisements (OJAs) is crucial for labor market analysis and recruitment. This study addresses the limitations of manual data annotation by leveraging synthetic data generation (SDG) techniques using large language models (LLMs). We evaluate four SDG methods—plain prompting, sampling, precise attributes, and adjective attributes—to generate synthetic job ads and assess their impact on classification model performance. Our analysis focuses on the balance between dataset size, data diversity and label-fit, and we explore the use of Natural Language Inference (NLI) filtering to enhance data quality. Results show that models trained on synthetic data can effectively classify real-world job ads, achieving competitive performance. However, we observed significant volatility in outcomes, which we could not fully explain. By making our code and data publicly available, we provide the research community with opportunities to further investigate SDG techniques. By publishing our best models, we offer researchers tools capable of achieving up to 96% F1 on a real-world dataset for classifying German OJAs by employment type.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: human evaluation, automatic evaluation, NLP tools for social analysis, fine-tuning, data augmentation
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data analysis
Languages Studied: German
Submission Number: 5206
Loading