Evaluating open- and closed-source LLM-based chatbots to combat cybercrime targeting senior citizens
Abstract: This study evaluates state-of-the-art open-source and closed-source models that we trained on a cybersecurity Q&A task to assist senior citizens in recognizing and responding to cybercrimes. We evaluate five LLMs and their finetuned variants using automatic evaluation metrics such as F1, BertScore, n-gram based overlap, and human evaluation of response quality through seven criteria, including accuracy, relevance, and usefulness. Our evaluation results show that several open-source models, particularly fine-tuned variants, outperform the closed-source model, with Mistral3-LoRA leading on nearly all automatic evaluation metrics and LLaMA3.1-LoRA achieving the highest recall. However, ChatGPT-4o slightly outperforms in the human evaluation task, with annotators preferring its responses for their formatting and polished language. Our chatbot application, code, and data is available at https://anonymous.4open.science/r/SeniorSafeAI-36F4/.
Paper Type: Long
Research Area: Special Theme (conference specific)
Research Area Keywords: Interdisciplinary recontextualization of NLP; Cybersecurity; Evaluation and metrics; Human evaluation; Automatic evaluation; Large language models; Senior citizens
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Submission Number: 3605
Loading