Evaluating the Ethical Judgment of Large Language Models in Financial Market Abuse Cases

Avinash Kumar Pandey, Swati Rajwal

Published: 15 Nov 2025, Last Modified: 26 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0
Abstract: This paper presents a novel benchmark that evaluates how large language models (LLMs) respond to trading scenarios involving potential market abuse. Starting from 73 manually anonymized real enforcement cases, we generate 1,971 scenario variants using a factorial Taguchi L27 design that varies key contextual factors such as reward level, regulatory tone, and risk appetite. Ten state-of-the-art LLMs (both open-sourced and proprietary) are tested on six tasks that simulate decisions made by traders and are of interest to compliance officers. Results show that larger models are more cautious overall but remain sensitive to incentives: higher expected profits, obfuscated language, and compliant framing increase the likelihood of approving suspicious trades. Misclassification and ethically questionable approvals occur across all models, highlighting critical gaps in the judgment of LLMs. Our benchmark offers a structured foundation for improving model safety and reliability in high-stakes financial applications.
Loading