LOB-Bench: Benchmarking Generative AI for Finance - an Application to Limit Order Book Data

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: LOB-Bench offers a rigorous framework and open-source Python package for standardized evaluation of generative limit order book data models, addressing evaluation gaps and enhancing model comparisons with quantitative metrics.
Abstract: While financial data presents one of the most challenging and interesting sequence modelling tasks due to high noise, heavy tails, and strategic interactions, progress in this area has been hindered by the lack of consensus on quantitative evaluation paradigms. To address this, we present **LOB-Bench**, a benchmark, implemented in python, designed to evaluate the quality and realism of generative message-by-order data for limit order books (LOB) in the LOBSTER format. Our framework measures distributional differences in conditional and unconditional statistics between generated and real LOB data, supporting flexible multivariate statistical evaluation. The benchmark also includes features commonly used LOB statistics such as spread, order book volumes, order imbalance, and message inter-arrival times, along with scores from a trained discriminator network. Lastly, LOB-Bench contains "market impact metrics", i.e. the cross-correlations and price response functions for specific events in the data. We benchmark generative autoregressive state-space models, a (C)GAN, as well as a parametric LOB model and find that the autoregressive GenAI approach beats traditional model classes.
Lay Summary: High-frequency trading data is an interesting task to consider for models which aim to generate sequential events, i.e. where a given generated piece of data depends on all the previously generated data. There are a number of models which attempt to generate this kind of high-frequency financial data using different approaches, but it is very difficult to compare them. This paper aims to provide a series of evaluations to allow for the comparison of data generated by different models. The benchmark consists of three main parts. The first considers different features that can be measured from a sequence of generated events. For example, the number of orders to buy or a sell a stock in a time period. Measuring this feature across a large number of sequences allows for the construction of a distribution, in practice a histogram. This distribution can be constructed for both real and generated data, and metrics can be applied to measure how similar or different these distributions are. We also consider the case, where a neural network itself learns the distinguishing features of the data. Secondly, we consider the price impact of different order types. A sign that a generative model is able to reproduce sequences well is if the arrival of a given order type moves the price in an expected way, on average, over some time-period. Finally, we use generated data to see how this affects the learning of a trend forecasting task. We compare both cutting-edge models for sequence generation with more traditional models and find that the newer models outperform the traditional ones. Having access to this sort of benchmark is very important as it allows researchers to compare how good their models are in this application in a standardised way.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/peernagy/lob_bench
Primary Area: Applications->Time Series
Keywords: finance, generative models, time series, state-space models, benchmark
Submission Number: 15644
Loading