The Causal Reasoning Ability of Open Large Language Model: A Comprehensive and Exemplary Functional Testing
Abstract: As the intelligent software, the development and application of large language models are extremely hot topics recently, bringing tremendous changes to general AI and software industry. Nonetheless, large language models, especially open source ones, incontrollably suffer from some potential software quality issues such as instability, inaccuracy, and insecurity, making software testing necessary. In this paper, we propose the first solution for functional testing of open large language models to check full-scene availability and conclude empirical principles for better steering large language models, particularly considering their black box and intelligence properties. Specifically, we focus on the model’s causal reasoning ability, which is the core of artificial intelligence but almost ignored by most previous work. First, for comprehensive evaluation, we deconstruct the causal reasoning capability into five dimensions and summary the forms of causal reasoning task as causality identification and causality matching. Then, rich datasets are introduced and further modified to generate test cases along with different ability dimensions and task forms to improve the testing integrity. Moreover, we explore the ability boundary of open large language models in two usage modes: prompting and lightweight fine-tuning. Our work conducts comprehensive functional testing on the causal reasoning ability of open large language models, establishes benchmarks, and derives empirical insights for practical usage. The proposed testing solution can be transferred to other similar evaluation tasks as a general framework for large language models or their derivations.
External IDs:dblp:conf/qrs/LiZLLH23
Loading