Is Your Language Model Ready for Monetization Decisions?

Is Your Language Model Ready for Monetization Decisions?

ACL ARR 2026 January Submission7116 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Evaluation Benchmark, Monetization Systems

Abstract: Language language models~(LLMs) are increasingly used in monetization systems for user understanding, advertisement selection, and recommendation evaluation. However, existing benchmarks primarily focus on general knowledge and open-domain reasoning, offering limited coverage of the domain-specific challenges present in real-world monetization settings. We introduce MonBench, a high-quality multi-task benchmark designed to evaluate language models in realistic monetization scenarios. MonBench is formulated as multiple-choice questions spanning user intent understanding and relationship reasoning, and is constructed from billions of production-scale monetization data points with expert annotations and controlled difficulty to ensure reliability and discriminative power. We evaluate approximately 20 state-of-the-art LLMs on MonBench and uncover substantial performance gaps relative to general-purpose benchmarks, exposing persistent limitations in monetization-specific reasoning and intent understanding. We further analyze sensitivity to evaluation and prompting choices, including answer order, output formats, instruction fine-tuning, reasoning, and in-context learning, and conduct per-task error analyses to identify reasoning-intensive monetization challenges that remain difficult even for the strongest models. Together, these findings establish MonBench as a rigorous benchmark and offer practical guidance for developing and evaluating LLMs in real-world monetization systems.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, evaluation methodologies, NLP datasets, automatic evaluation of datasets, evaluation, metrics, reproducibility

Contribution Types: Data resources

Languages Studied: English

Submission Number: 7116

Loading