IPM-Bench: A Multi-task Benchmark for Evaluating Large Language Models in Integrated Pest Management
Keywords: Integrated Pest Management, domain-specific benchmark, multi-task evaluation, agricultural decision support, large language models, evidence grounding, structured output compliance
Abstract: Integrated Pest Management (IPM) is essential for sustainable agriculture and global food security, yet its complexity presents significant challenges for AI support (Deguine et al., 2021; Zhou et al., 2024). To address the lack of evaluation tools for large language models (LLMs) in this domain, we present IPM-Bench, a comprehensive expert-curated multi-task benchmark tailored to real-world IPM scenarios. It comprises 2,600 high-quality examples sourced from thousands of expert-authored documents published by leading U.S. university extension programs and global plant health knowledge resources (Sivapragasam and Chan, 2017). Covering 13 diverse task types including classification, question answering, summarization, and named entity recognition, the benchmark mirrors the full spectrum of IPM workflows from pest diagnosis to management decision making. We evaluated a broad range of frontier, open-source, and agriculture-oriented LLMs to establish performance baselines. While many models demonstrate strong general capabilities, they often struggle with complex reasoning and strict format adherence. IPM-Bench offers a critical resource for advancing and assessing AI systems for agricultural decision-making under workflow-oriented, evidence-grounded evaluation. We will release the benchmark and evaluation code to support reproducible research.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, resources and evaluation, domain-specific evaluation, multi-task benchmarks, large language models, workflow-oriented evaluation, agricultural NLP
Contribution Types: Data resources, Data analysis
Languages Studied: English
Submission Number: 6837
Loading