Improving Soft Skill Extraction via Data Augmentation and Embedding Manipulation

Muhammad Uzair-Ul-Haq, Paolo Frazzetto, Alessandro Sperduti, Giovanni Da San Martino

Published: 2024, Last Modified: 02 Oct 2025SAC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Soft skills (SS) are important for Human Resource Management when recruiting suitable candidates for a job. Nowadays, enterprises aim to automatically extract such information from documents, curriculum vitae (CVs) and job descriptions, to speed up their recruitment process. State-of-the-art Large Language Models (LLMs) have been successful in Natural Language Processing (NLP) by fine-tuning them to the domain-specific task. However, annotated data for the task is very limited and costly to obtain, since it requires domain experts. Moreover, SS consists of complex long entities which are difficult to extract given few annotated examples. As a consequence, the performance of the LLMs on soft skill detection still needs improvement before being used in a real-world context. In this paper, we introduce data augmentation based entity extraction approach which shows promising performance when the entity length is long (i.e more than three tokens). Moreover, we explore the performance of pre-trained LLMs to generate synthetic data for training. The pre-trained models are used to generate contextual augmentation of the baseline dataset. We further analyse the embeddings generated by these models in aiding the extraction process of entities. We develop an Embedding Manipulation (EM) approach to further improve the performance of baseline models. We evaluated our approach on the only publicly available dataset for soft skills (SKILLSPAN), and on three Entity Extraction datasets (GUM, WNUT-2017 and CoNLL-2003) to assess the proposed approach. Empirical evidence shows that the proposed approach allows us to get 6.52% increased F1 over the baseline model for the soft skills.

External IDs:dblp:conf/sac/Uzair-Ul-HaqFSM24