IPM-Bench: A Multi-task Benchmark for Evaluating Large Language Models in Integrated Pest Management

IPM-Bench: A Multi-task Benchmark for Evaluating Large Language Models in Integrated Pest Management

ACL ARR 2026 January Submission6837 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Integrated Pest Management, domain-specific benchmark, multi-task evaluation, agricultural decision support, large language models, evidence grounding, structured output compliance

Abstract: Integrated Pest Management (IPM) is essential for sustainable agriculture and global food security, yet its complexity presents significant challenges for AI support (Deguine et al., 2021; Zhou et al., 2024). To address the lack of evaluation tools for large language models (LLMs) in this domain, we present IPM-Bench, a comprehensive expert-curated multi-task benchmark tailored to real-world IPM scenarios. It comprises 2,600 high-quality examples sourced from thousands of expert-authored documents published by leading U.S. university extension programs and global plant health knowledge resources (Sivapragasam and Chan, 2017). Covering 13 diverse task types including classification, question answering, summarization, and named entity recognition, the benchmark mirrors the full spectrum of IPM workflows from pest diagnosis to management decision making. We evaluated a broad range of frontier, open-source, and agriculture-oriented LLMs to establish performance baselines. While many models demonstrate strong general capabilities, they often struggle with complex reasoning and strict format adherence. IPM-Bench offers a critical resource for advancing and assessing AI systems for agricultural decision-making under workflow-oriented, evidence-grounded evaluation. We will release the benchmark and evaluation code to support reproducible research.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, resources and evaluation, domain-specific evaluation, multi-task benchmarks, large language models, workflow-oriented evaluation, agricultural NLP

Contribution Types: Data resources, Data analysis

Languages Studied: English

Submission Number: 6837

Loading