DISC: A Dataset for Information Security Classification

Elijah Bass, Massimiliano Albanese, Marcos Zampieri

Published: 2024, Last Modified: 08 Apr 2025SECRYPT 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Research in information security classification has traditionally relied on carefully curated datasets. However, the sensitive nature of the classified information contained in such documents poses challenges in terms of accessibility and reproducibility. Existing data sources often lack openly available resources for automated data collection and quality review processes, making it difficult to facilitate reproducible research. Additionally, datasets constructed from declassified information, though valuable, are not readily available to the public, and their creation methods remain poorly documented, rendering them non-reproducible. This paper addresses these challenges by introducing DISC, a dataset and framework, driven by artificial intelligence principles, for information security classification. This process aims to streamline all the stages of dataset creation, from preprocessing of raw documents to annotation. By enabling reproducibility and augmentation, this approach enhan