Scoring Assessment Center Exercises with LLMs

Scoring Assessment Center Exercises with LLMs

ACL ARR 2024 June Submission2538 Authors

15 Jun 2024 (modified: 23 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We compared six methods for classification-based scoring of an assessment center (AC) exercise that tests managerial coaching skills. The accurate scoring of these skills is crucial for selecting the most capable job candidates. Four of the methods we employed were based on fine-tuning — with Gemma-7b, Llama3-8b, Phi-3, and RoBERTa-base, respectively. The other two methods were zero-shot prompting and few-shot prompting with GPT-4o. Phi-3 and Gemma-7b performed the best across the fine-tuned LLMs, and Llama3-8b followed them closely. RoBERTa performed robustly, it had the best performance for one of the coaching skills, but in general performed slightly lower than the fine-tuned LLMs. Zero-shot and few-shot prompting with GPT-4o performed the worst, but zero-shot performed better than few-shot. The pattern of results indicates that the complexity and psychological nature of each skill might be interacting with model performance.

Paper Type: Short

Research Area: NLP Applications

Research Area Keywords: NLP Applications

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Data analysis

Languages Studied: English

Submission Number: 2538

Loading