Keywords: Large Language Models, Evaluation Benchmark, Monetization Systems
Abstract: Language language models~(LLMs) are increasingly used in monetization systems for user understanding, advertisement selection, and recommendation evaluation. However, existing benchmarks primarily focus on general knowledge and open-domain reasoning, offering limited coverage of the domain-specific challenges present in real-world monetization settings. We introduce MonBench, a high-quality multi-task benchmark designed to evaluate language models in realistic monetization scenarios. MonBench is formulated as multiple-choice questions spanning user intent understanding and relationship reasoning, and is constructed from billions of production-scale monetization data points with expert annotations and controlled difficulty to ensure reliability and discriminative power. We evaluate approximately 20 state-of-the-art LLMs on MonBench and uncover substantial performance gaps relative to general-purpose benchmarks, exposing persistent limitations in monetization-specific reasoning and intent understanding. We further analyze sensitivity to evaluation and prompting choices, including answer order, output formats, instruction fine-tuning, reasoning, and in-context learning, and conduct per-task error analyses to identify reasoning-intensive monetization challenges that remain difficult even for the strongest models. Together, these findings establish MonBench as a rigorous benchmark and offer practical guidance for developing and evaluating LLMs in real-world monetization systems.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, evaluation methodologies, NLP datasets, automatic evaluation of datasets, evaluation, metrics, reproducibility
Contribution Types: Data resources
Languages Studied: English
Submission Number: 7116
Loading