EcomEval: Towards Reliable Evaluation of Large Language Models for Multilingual and Multimodal E-Commerce Applications

Shuyi Xie; Zi Qin Liew; Hailing Zhang; Haibo Zhang; Ling Hu; ZHOU ZHIQIANG; shuman liu; Anxiang Zeng

EcomEval: Towards Reliable Evaluation of Large Language Models for Multilingual and Multimodal E-Commerce Applications

Shuyi Xie, Zi Qin Liew, Hailing Zhang, Haibo Zhang, Ling Hu, ZHOU ZHIQIANG, shuman liu, Anxiang Zeng

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, E-Commerce, Evaluation, Benchmark, Multilingual, Multimodal

Abstract: Large Language Models (LLMs) excel on general-purpose NLP benchmarks, yet their capabilities in specialized domains remain underexplored. In e-commerce, existing evaluations—such as EcomInstruct, ChineseEcomQA, eCeLLM, and Shopping MMLU—suffer from limited task diversity (e.g., lacking product guidance and after-sales issues), limited task modalities (e.g., absence of multimodal data), synthetic or curated data, and a narrow focus on English and Chinese, leaving practitioners without reliable tools to assess models on complex, real-world shopping scenarios. We introduce EcomEval, a comprehensive multilingual and multimodal benchmark for evaluating LLMs in e-commerce. EcomEval covers six categories and 37 tasks (including 8 multimodal tasks), sourced primarily from authentic customer queries and transaction logs, reflecting the noisy and heterogeneous nature of real business interactions. To ensure both quality and scalability of reference answers, we adopt a semi-automatic pipeline in which large models draft candidate responses subsequently reviewed and modified by over 50 expert annotators with strong e-commerce and multilingual expertise. We define difficulty levels for each question and task category by averaging evaluation scores across models with different sizes and capabilities, enabling challenge-oriented and fine-grained assessment. EcomEval also spans eight languages—including four low-resource Southeast Asian languages—offering a multilingual perspective absent from prior work. We evaluate 19 open and proprietary LLMs on EcomEval, revealing substantial performance disparities and highlighting scenarios where these general-purpose models perform poorly in the e-commerce domain. By combining diversity, authenticity, quality, difficulty awareness, multilinguality and multimodality, EcomEval establishes a rigorous and representative testbed for advancing research and deployment of LLMs in e-commerce. Upon acceptance, we will release the full dataset to support reproducible research.

Primary Area: datasets and benchmarks

Submission Number: 9218

Loading