Continued pre-training of LLMs for Portuguese and Government domain: A proposal for product identification in textual purchase descriptions

Published: 12 Dec 2023, Last Modified: 26 Feb 2024PubLLM 2024EveryoneRevisionsBibTeXCC BY 4.0
Track Selection: Track 1: Developing LLM-powered tools for positive outcomes
Keywords: product identification, Large Language Model, Continued pre-training
TL;DR: Continued pre-training of LLMs for Portuguese and Government domain: A proposal for product identification in textual purchase descriptions
Abstract: The present study addresses the issue of identifying products in non-standardized invoices, presenting an approach based on large language models (LLMs). Faced with the scarcity of models trained in the Portuguese language, we proceeded with the pre-training of two LLMs, Lamma2-7B and Mistral-Instruct-7B, followed by fine-tuning for the specific task of product identification. Our central hypothesis, "continuing the pre-training of LLMs with Portuguese texts enhances the model's ability to identify products in textual purchase descriptions", was supported by the results, revealing significant improvements when compared to the original models. This research contributes not only to solving a practical problem but also highlights the effectiveness of continuing the pre-training of a LLM in specific linguistic contexts, such as Portuguese.
Submission Number: 3
Loading