Keywords: NLP; SNOMED CT; concept normalization; entity linking; Brazilian Portuguese; low-resource languages; hybrid systems; deterministic methods; embeddings; large language models; clinical coding; healthcare interoperability; LLM; ML; AI
TL;DR: Hybrid, deterministic-first cascade for accurate, low-cost SNOMED CT concept ID assignment in Brazilian Portuguese clinical text.
Abstract: Mapping free-text clinical narratives to standardized ontologies is a critical bottleneck in healthcare NLP, yet remains largely unsolved for low- resource languages. To our knowledge, no automated system exists for SNOMED CT concept ID assignment in Brazilian Portuguese, limiting structured coding across one of the world’s largest public health systems. We introduce SNOKind, a hybrid decision cascade that combines heuristic rules, embedding-based similarity, and LLM reasoning to map clinical entities extracted from Brazilian EHRs to SNOMED CT concepts. SNOKind is designed as a cost-aware pipeline: deterministic rules resolve the majority of cases cheaply and transparently, with stochastic methods reserved for genuinely ambiguous cases. This ordering strictly dominates LLM-first alternatives in cost and reliability while achieving better resolution coverage of 92%, compared to 57% for a standalone LLM baseline. Beyond performance, SNOKind demonstrates that orchestration over isolated model expressiveness is the key design principle for robust clinical NLP in constrained, real-world settings.
Submission Category: Extended Abstract
Overaged Verification: Yes
Latin American Hispanic Heritage: Yes
Icml Proceedings Status: No
Submission Number: 20
Loading