Abstract: We evaluate the effectiveness of synthetic data fine-tuning for Semantic Search in a real-world Enterprise Team Formation problem scenario. In this problem, we aim to retrieve the best employee for a given task, given their information regarding abilities, experiences, and other aspects. We evaluate two synthetic data generation strategies: (1) augmenting real-world data with synthetic labels and (2) generating synthetic profiles for employees tailored to specific tasks. To measure the impact of these strategies, we fine-tune a pretrained text embedding model using LoRA and Rank Aggregation techniques. We evaluate the model performance against current state-of-the-art algorithms on a human-curated dataset. Our experiments indicate that training a model that uses a combination of both Synthetic data generation strategies outperforms already established pre-trained models on the Team Formation task, improving the tested ranking metrics by an average of 30\% in comparison to the best-performing pre-trained model.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: financial/business NLP, NLP in resource-constrained settings, dense retrieval
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: Portuguese
Submission Number: 4964
Loading