Abstract: In this article, we present the results from our interdisciplinary work to identify pesticide names in research articles primarily in Brazilian Portuguese, but also in Spanish, French, and Italian. We proceed cross-lingually, extracting information from a large, high-quality corpus in English, which we then apply to the lower-resource languages. We show that a combination of a state-of-the-art multilingual transformer models, sentence-based similarity metrics, and expert knowledge yields the best results in our low-resource task. It yields twice as many true positives as the November 2023 version of gpt-4, and it decisively outperforms other baselines, including a classical NER-model fine-tuned on our training data. Our approach offers a promising start and might be transferable to other similarly demanding tasks in low-resource contexts.
Paper Type: long
Research Area: Efficient/Low-Resource Methods for NLP
Contribution Types: Approaches to low-resource settings
Languages Studied: English, Brazilian Portuguese, Spanish, French, Italian
