Refining Sensitive Document Classification: Introducing an Enhanced Dataset Proposal

Published: 2024, Last Modified: 24 Jan 2026SMC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The need for document exchange between people, companies and government increases every day. Consequently, safeguarding documents against potential attackers becomes increasingly crucial. Several attacks have been reported over the past years and the risk of document leak is more present nowadays. To prevent data violation, we need tools to determine the sensitivity degree of documents which allows us to guarantee that only authorized people have access to them and to adapt strategies to sensitivity levels. To achieve this, deep learning techniques have shown good performances in document classification and therefore in sensitivity identification. Such approaches require sufficiently large resources to learn robust models. However, due to the sensitive nature of documents, public datasets are missing to conduct research in this context. In this paper, we experiment with Large Language Models (LLM) to generate a multi-domain dataset of business documents in both english and french languages. Utilizing a two-step generation process, we employ several prompting strategies across six language models to create a first dataset of documents classified into 4 sensitivity classes: Public, Internal, Confidential and Restricted. We then relied on human experts to review validate the annotations generated in a sample of documents. The generated dataset has been tested over two robust baselines.
Loading