mSOP-765k: A Benchmark For Multi-Modal Structured Output Predictions

TMLR Paper5863 Authors

10 Sept 2025 (modified: 28 Oct 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper introduces mSOP-765k, a large-scale benchmark for the evaluation of multi-modal Structured Output Prediction (mSOP) pipelines. Besides novel evaluation metrics, the benchmark provides combined training and test datasets with over 765,000 images taken from real-world product advertisements. Each of these images contains product visualizations, textual information like product name or brand, and numerical data such as product weight, price, and discount. All images are annotated with the corresponding structured information in form of dictionaries containing key-value pairs. An initial baseline evaluation, including various LLMs and VLMs, as well as multi-modal RAG approaches, shows that the proposed benchmark provides a challenging problem which can not yet be solved completely by state-of-the-art mSOP methods. The benchmark and dataset are available under a creative-commons license: https://huggingface.co/datasets/retail-product-promotion/mSOP-765k
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Mohammad_Ghavamzadeh1
Submission Number: 5863
Loading