INS-ActBench: A Comprehensive Benchmark for Assessing Professional Actuarial Capability of Large Language Models

INS-ActBench: A Comprehensive Benchmark for Assessing Professional Actuarial Capability of Large Language Models

ACL ARR 2026 January Submission9183 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Actuarial Science, Large Language Models, Benchmarking, Insurance AI, Professional Expertise.

Abstract: While Large Language Models (LLMs) have shown strong performance in general financial tasks, their capabilities in \textbf{actuarial science}—the quantitative foundation of the insurance industry—remain insufficiently evaluated. Existing benchmarks are largely limited to knowledge-oriented question answering or capital market–focused tasks, and fail to assess practical actuarial modeling and execution skills. To bridge this gap, we introduce \textbf{INS-ActBench}, a comprehensive benchmark engineered to shift the evaluation paradigm from ``declarative knowledge'' to ``professional capability.'' Grounded in the intersection of cross-jurisdictional competency actuarial frameworks and Bloom's Taxonomy, we construct a rigorous four-tier benchmark comprising 6,514 authentic tasks. By integrating six innovative task type, our pipeline validates models against real-world professional standards. Extensive evaluation reveals a distinct ``\textit{Strong Theory, Weak Practice}'' phenomenon: while models exhibit proficiency in conceptual calculation, their performance deteriorates significantly in tasks requiring precise tool manipulation and multi-step logical derivation. These findings suggest that current LLMs are best positioned as assistants rather than autonomous actuarial agents, providing a critical quantitative baseline for the responsible deployment of LLMs in high-stakes financial risk management. The codes and data are available at \url{https://anonymous.4open.science/r/ActuarialBench-3B5D}.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Resources and Evaluation, NLP Applications, Question Answering, Language Modeling

Contribution Types: Data resources, Data analysis

Languages Studied: English

Submission Number: 9183

Loading