Continued pre-training of LLMs for Portuguese and Government domain: A proposal for product identification in textual purchase descriptions

Eduardo Soares de Paiva; Fernando Sola Pereira; David da Guia Carvalho; Nilson Romero Michiles Junior; Rennis Sousa de Oliveira; Stella Mendes Meireles Bonifacio; Andre Luiz Monteiro da Rocha; Hamilton Luiz Rodrigues de Oliveira; Felipe de Abreu Moreira Cezar; Helio Theodoro Junior

Continued pre-training of LLMs for Portuguese and Government domain: A proposal for product identification in textual purchase descriptions

Eduardo Soares de Paiva, Fernando Sola Pereira, David da Guia Carvalho, Nilson Romero Michiles Junior, Rennis Sousa de Oliveira, Stella Mendes Meireles Bonifacio, Andre Luiz Monteiro da Rocha, Hamilton Luiz Rodrigues de Oliveira, Felipe de Abreu Moreira Cezar, Helio Theodoro Junior

Published: 12 Dec 2023, Last Modified: 26 Feb 2024PubLLM 2024EveryoneRevisionsBibTeXCC BY 4.0

Track Selection: Track 1: Developing LLM-powered tools for positive outcomes

Keywords: product identification, Large Language Model, Continued pre-training

TL;DR: Continued pre-training of LLMs for Portuguese and Government domain: A proposal for product identification in textual purchase descriptions

Abstract: The present study addresses the issue of identifying products in non-standardized invoices, presenting an approach based on large language models (LLMs). Faced with the scarcity of models trained in the Portuguese language, we proceeded with the pre-training of two LLMs, Lamma2-7B and Mistral-Instruct-7B, followed by fine-tuning for the specific task of product identification. Our central hypothesis, "continuing the pre-training of LLMs with Portuguese texts enhances the model's ability to identify products in textual purchase descriptions", was supported by the results, revealing significant improvements when compared to the original models. This research contributes not only to solving a practical problem but also highlights the effectiveness of continuing the pre-training of a LLM in specific linguistic contexts, such as Portuguese.

Submission Number: 3

Loading