Abstract: As Large Language Model (LLM) based chatbots are becoming more accessible, users are relying on these chatbots for reliable and personalized recommendations in diverse domains, ranging from code generation to financial advisement. In this context, we set out to investigate how such systems perform in the personal finance domain, where financial inclusion has been an overarching stated aim of banks for decades. We test widely used LLM-based chatbots, ChatGPT and Bard, and compare their performance against SafeFinance, a rule-based chatbot built using the Rasa platform. The comparison is across two critical tasks: product discovery and multi-product interaction, where product refers to banking products like Credit Cards, Certificate of Deposits, and Checking Accounts. With this study, we provide interesting insights into the chatbots’ efficacy in financial advisement and their ability to provide fair treatment across different user groups. We find that both Bard and ChatGPT can make errors in retrieving basic online information, the responses they generate are inconsistent across different user groups, and they cannot be relied on for reasoning involving banking products. On the other hand, despite their limited generalization capabilities, rule-based chatbots like SafeFinance provide safe and reliable answers to users that can be traced back to their original source. Overall, although the outputs of the LLM-based chatbots are fluent and plausible, there are still critical gaps in providing consistent and reliable financial information.
Loading