Abstract: This paper presents a novel benchmark that evaluates how large language models (LLMs) respond to trading scenarios involving potential market abuse. Starting from 73 manually anonymized real enforcement cases, we generate 1,971 scenario variants using a factorial Taguchi L27 design that varies key contextual factors such as reward level, regulatory tone, and risk appetite. Ten state-of-the-art LLMs (both open-sourced and proprietary) are tested on six tasks that simulate decisions made by traders and are of interest to compliance officers. Results show that larger models are more cautious overall but remain sensitive to incentives: higher expected profits, obfuscated language, and compliant framing increase the likelihood of approving suspicious trades. Misclassification and ethically questionable approvals occur across all models, highlighting critical gaps in the judgment of LLMs. Our benchmark offers a structured foundation for improving model safety and reliability in high-stakes financial applications.
External IDs:doi:10.1145/3768292.3770439
Loading