Language-Independent Embeddings for Entity Recognition via LLM Data-Level Knowledge Distillation

Sergei Bogdanov; Alexandre Constantin; Etienne Bernard

Language-Independent Embeddings for Entity Recognition via LLM Data-Level Knowledge Distillation

Sergei Bogdanov, Alexandre Constantin, Etienne Bernard

24 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: token representations, word representations, Named Entity Recognition, NER, Entity, Entity Recognition, Foundation Model, foundation model, language-independent, domain-agnostic

Abstract: Entity Recognition has always been one of the most important problems in Natural Language Processing. However, there wasn't much research aimed at creating a high-quality Multilingual Domain-Agnostic Foundation Model for Entity Recognition task. We introduce novel LLM-powered Data-Creation and Contrastive Learning-based Pre-Training procedures that enable us to create a new state-of-the-art Foundation Model for Entity Recognition task. It is designed and trained to have high performance on data coming from different domains and to enable language-independent features thanks to a new diverse multilingual training dataset. Our contribution surpasses all existing models on the English and Multilingual Entity Recognition tasks when used as a Foundation Model. We improved the macro F1-Score of multilingual BERT by 10 points on the single-language scenario and by 13.5 points on the multi-language scenario in French, German, English, Spanish, Italian, Polish, Portuguese, and Russian on the Entity Recognition Task. We open-source our model on the HuggingFace platform.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9203

Loading