Evaluating AI-driven Psychotherapy: Insights from Large Language Models and Human Expert Comparisons

ACL ARR 2024 June Submission660 Authors

12 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The integration of Large Language Models (LLMs), such as GPT-4, has shown great promise in mental health applications for initial assessments based on user-reported symptoms. Traditional assessments often involve subjective evaluations by professional psychologists, leading to inconsistent reproducibility across datasets. To address this, we developed a comprehensive evaluation framework using entropy analysis, keyword frequency analysis, and Latent Dirichlet Allocation (LDA) to quantitatively assess LLM outputs. Our results indicate that LLMs can effectively identify and engage with a range of treatment topics and provide a broader range of treatment opinions than human psychologists. However, LLMs lack depth in their responses, the recommendation generated by LLMs trends to using generalized word instead of using professional words. This study explores the feasibility of LLMs as virtual psychotherapists, highlights their shortcomings in depth, and proposes improved methods for evaluating large model responses. This research provides valuable insights into the potential and challenges of integrating LLMs into mental health practices, paving the way for future research to enhance the effectiveness and reliability of AI-driven therapeutic solutions.
Paper Type: Long
Research Area: Human-Centered NLP
Research Area Keywords: human-AI interaction, human-centered evaluation, user-centered design
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: English
Submission Number: 660
Loading