Keywords: Annotation Guidelines, Annotation Moderation, Large Language Models, Biomedical Named Entity Recognition
Abstract: While Large Language Models (LLMs) demonstrates remarkable zero-shot annotation tasks, they often struggle with the specialized conventions of gold-standard benchmarks. We propose the systematic reuse and refinement of annotation guidelines as an alignment mechanism, introducing an iterative moderation framework that simulates the early phases of annotation projects. We evaluate three hypotheses: (1) the efficacy of guideline integration, (2) the advantage of reasoning-optimized models, and (3) the viability of moderation under minimal supervision. Testing across biomedical NER tasks (NCBI Disease, BC5CDR, BioRED) with three LLM families (GPT, Gemini, DeepSeek), our results empirically confirm all three hypotheses. While the iterative moderation framework shows a good potential in effectively refining guidelines, our analysis also reveals a significant room for improvement.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: automatic creation and evaluation of language resources
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data analysis
Languages Studied: English
Submission Number: 5933
Loading