Track: tiny / short paper (up to 2 pages)
Keywords: ML Architecture, SME, Domain Expertise
Abstract: Recent advances in large language models have led to increased adoption across specialized domains, but their effectiveness on tasks with limited training data remains unclear. We investigate this question through bias detection in medical curriculum text, comparing models ranging from DistilBERT (67M parameters) to Llama-3.2 (1.2B parameters) using both sequence classification and causal language modeling approaches. Our findings challenge conventional assumptions about model scaling: while the instruction-tuned Llama achieved the strongest screening performance (AUC: 0.7904, F2: 0.5760), architectural choices proved more critical than model size. DistilBERT demonstrated competitive performance through targeted architectural choices, achieving the second-highest AUC (0.8857) despite its smaller size. These results suggest that for specialized classification tasks with limited training data, architectural alignment and instruction tuning may be more crucial than increased model capacity. Our work provides practical insights for deploying language models in domain-specific applications where expert annotation is expensive and dataset size is necessarily limited.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 43
Loading