Keywords: token representations, word representations, Named Entity Recognition, NER, Entity, Entity Recognition, Foundation Model, foundation model, language-independent, domain-agnostic
Abstract: Entity Recognition has always been one of the most important problems in Natural Language Processing. However, there wasn't much research aimed at creating a high-quality Multilingual Domain-Agnostic Foundation Model for Entity Recognition task. We introduce novel LLM-powered Data-Creation and Contrastive Learning-based Pre-Training procedures that enable us to create a new state-of-the-art Foundation Model for Entity Recognition task. It is designed and trained to have high performance on data coming from different domains and to enable language-independent features thanks to a new diverse multilingual training dataset. Our contribution surpasses all existing models on the English and Multilingual Entity Recognition tasks when used as a Foundation Model. We improved the macro F1-Score of multilingual BERT by 10 points on the single-language scenario and by 13.5 points on the multi-language scenario in French, German, English, Spanish, Italian, Polish, Portuguese, and Russian on the Entity Recognition Task. We open-source our model on the HuggingFace platform.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9203
Loading