LOB-Bench: Benchmarking Generative AI for Finance - with an Application to Limit Order Book Markets

Peer Nagy; Sascha Yves Frey; Kang Li; Svitlana Vyetrenko; Stefan Zohren; Ani Calinescu; Jakob Nicolaus Foerster

LOB-Bench: Benchmarking Generative AI for Finance - with an Application to Limit Order Book Markets

Peer Nagy, Sascha Yves Frey, Kang Li, Svitlana Vyetrenko, Stefan Zohren, Ani Calinescu, Jakob Nicolaus Foerster

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: finance, generative models, time series, state-space models, benchmark

TL;DR: LOB-Bench offers a rigorous framework and open-source Python package for standardized evaluation of generative limit order book data models, addressing evaluation gaps and enhancing model comparisons with quantitative metrics.

Abstract: We present **LOB-Bench**, a benchmark designed to evaluate the quality and realism of generative message-by-order data for limit order books (LOB). We enable a rigorous and comprehensive model comparison by providing both a theoretical framework and an open-source Python package. Addressing the lack of consensus on evaluation paradigms in the literature, where qualitative comparison of stylized facts is prevalent, our work offers a crucial building block for advancing generative AI for financial data. LOB-Bench provides a standardized method to numerically assess the quality of various model classes that generate limit order book data in the widely used LOBSTER format. It provides a range of quantitative characteristics and includes a simple parametric benchmark model as a baseline. Our framework measures distributional differences in conditional and unconditional statistics between generated and real LOB data, supporting a flexible multivariate statistical evaluation across different model classes. The benchmark features commonly used LOB statistics such as spread, order book volumes, order imbalance, and message inter-arrival times, along with adversarial scores derived from a neural network trained to differentiate between real and generated data. Additionally, LOB-Bench evaluates "market impact metrics" by computing cross-correlations and price response functions for specific events in the data. We present empirical benchmark results for a generative autoregressive state-space model, for a (C)GAN, and parametric LOB model. We find that the autoregressive GenAI approach beats traditional model classes.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 10620

Loading