Data Augmentation with GPT-3.5 for Vietnamese Natural Language Inference

Hieu-Hien Mai, Ngoc Hoang Luong

Published: 2023, Last Modified: 06 Jan 2026RIVF 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Data augmentation is a widely-used technique in natural language processing (NLP) for performance improvement and out-of-domain generalization. Current works on data augmentation for Vietnamese NLP tasks typically just modify one or several words (tokens) in each original sentence of an existing dataset, limiting the diversity of the augmented data. We investigate a recently-introduced data augmentation methodology, in which a pretrained large language model (LLM), particularly OpenAI GPT-3.5 Turbo in this paper, is used for generating new data as well as filtering high-quality data for the final usage. We focus on a natural language inference (NLI) task for the Vietnamese language with four labels: “entailment”, “contradiction”, “neural”, and “other”. Instead of replacing or deleting several words in each sentence as in most conventional approaches, our pipeline exploits the capability of the LLM to rewrite the sentences anew following the prompt for each label definition. Experimental results indicate that our augmented data can enhance the accuracy performance of Vietnamese classifiers in the NLI task with a better out-of-domain generalization.

External IDs:dblp:conf/rivf/MaiL23