CLEAR: Command Level Annotated Dataset for Ransomware Detection

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
Keywords: Dataset, Benchmark, Sequential Data, Cybersecurity, Ransomware, Transformers, LSTM
TL;DR: We introduce the Command Level Annotated Ransomware (CLEAR) dataset, a command-labeled large-scale collection of storage devices’ stream data, and show its effectiveness by using it to train models outperforming the current SotA
Abstract: Over the last decade, ransomware detection has become a central topic in cybersecurity research. Due to ransomware's direct interaction with storage devices, analyzing I/O streams has become an effective detection method and represents a vital area of focus for research. A major challenge in this field is the lack of publicly accessible data featuring individual command labeling. To address this problem, we introduce the Command LEvel Annotated Ransomware (CLEAR) dataset, a large-scale collection of storage devices' stream data. The dataset comprises 1,045 TiB of I/O traffic data, featuring malicious traffic from 137 ransomware variants. It offers two orders of magnitude more I/O traffic data and one order of magnitude more ransomware variants than any other publicly accessible dataset. Importantly, it is the only dataset that individually labels each I/O command as either ransomware or benign activity. This labeling enables the use of advanced sequential models, which we show to outperform existing state-of-the-art models by up to 82% in data loss prevention. Additionally, this allows us to create new tasks, such as data recovery, by selectively reverting only the commands recognized as ransomware while preserving benign activity. The CLEAR dataset also includes supplementary auxiliary features derived from the data, which we demonstrate to improve performance through feature ablation studies. Lastly, a critical aspect of any ransomware detection model is its robustness to new, unseen ransomware variants, as new strains constantly emerge. Therefore, we propose a benchmark based on our dataset to evaluate performance against unknown ransomware samples and illustrate its application across different models.
Croissant File: json
Dataset URL: http://kaggle.com/datasets/johndoenvme/clear-command-level-annotated-ransomware
Code URL: http://github.com/ravensorioles/rwdetection
Primary Area: Other (please use sparingly, only use the keyword field for more details)
Submission Number: 1973
Loading