JobSet: Synthetic Job Advertisements Dataset for Labour Market Intelligence

Samuele Colombo, Simone D'Amico, Lorenzo Malandri, Fabio Mercorio, Andrea Seveso

Published: 31 Mar 2025, Last Modified: 18 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: The use of online services for advertising job positions has grown in the last decade, thanks to the ability of Online Job Advertisements (OJAs) to observe the labour market in near real-time, predict new occupation trends, identify relevant skills, and support policy and decision-making activities. Unsurprisingly, 2023 was declared the Year of Skills by the EU, as skill mismatch is a key challenge for European economies. In such a scenario, machine learning-based approaches have played a key role in classifying job ads and extracting skills according to well-established taxonomies. However, the effectiveness of ML depends on access to annotated job advertisement datasets, which are often limited and require time-consuming manual annotation. The lack of OJA annotated benchmarks representative of the real online OJA and skills distributions is currently limiting advances in skill intelligence. To deal with this, we propose JobGen, which leverages Large Language Models (LLMs) to generate synthetic OJAs. We use real OJAs collected from an EU project and the ESCO taxonomy to represent job market distributions accurately. JobGen enhances data diversity and semantic alignment, addressing common issues in synthetic data generation. The resulting dataset, JobSet, provides a valuable resource for tasks like skill extraction and job matching and is openly available to the community1.

External IDs:doi:10.1145/3672608.3707718