Abstract: We compared six methods for classification-based scoring of an assessment center (AC) exercise that tests managerial coaching skills. The accurate scoring of these skills is crucial for selecting the most capable job candidates. Four of the methods we employed were based on fine-tuning — with Gemma-7b, Llama3-8b, Phi-3, and RoBERTa-base, respectively. The other two methods were zero-shot prompting and few-shot prompting with GPT-4o. Phi-3 and Gemma-7b performed the best across the fine-tuned LLMs, and Llama3-8b followed them closely. RoBERTa performed robustly, it had the best performance for one of the coaching skills, but in general performed slightly lower than the fine-tuned LLMs. Zero-shot and few-shot prompting with GPT-4o performed the worst, but zero-shot performed better than few-shot. The pattern of results indicates that the complexity and psychological nature of each skill might be interacting with model performance.
Paper Type: Short
Research Area: NLP Applications
Research Area Keywords: NLP Applications
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Data analysis
Languages Studied: English
Submission Number: 2538
Loading