Exploring Brazil's LLM Fauna: Investigating the Generative Performance of Large Language Models in Portuguese

Published: 2025, Last Modified: 06 Feb 2026J. Braz. Comput. Soc. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Large Language Models (LLMs) are now embedded in widely used applications worldwide, yet their evaluation still centers on narrow, discriminative benchmarks. These pipelines often overlook key generative aspects such as discourse coherence, linguistic transformations, and adequacy, which are crucial for real-world applications. In addition, most large-scale evaluations remain heavily biased toward English, limiting our understanding of LLM performance in other languages. This research addresses these gaps by presenting a comprehensive analysis of Brazilian Portuguese LLMs across three core Natural Language Generation tasks: summarization, simplification, and generative question answering. We evaluate six Brazilian models and compare them to the widely used GPT-4o. Our findings, supported by diverse automatic metrics, an LLM-as-a-judge framework, and human evaluation, show that GPT-4o series achieves the best generative performance in Portuguese, followed closely by the Sabiá-3 family. While slightly behind, the open-weight model Tucano stands out for its computational efficiency, making it a strong candidate for deployment in resource-constrained settings. The code used to conduct all experiments is publicly available at https://github.com/MeLLL-UFF/brfauna-gen-eval.
Loading