# Dataset Description

This document describes the datasets used in the ICLR submission for DGSM-SCAM-GAT and MMT-ViT models. Due to size constraints, datasets are not included in the submission package. Please download them from the provided links and place them in the specified directories.

## 1. dynamic_api_call_sequence
- **Description**: Contains dynamic API call sequences for malware classification, stored as CSV files with API call sequences and labels.
- **Classes**: 2 (malware vs. benign)
- **Format**: CSV files, each containing a sequence of API calls (e.g., `api_call` column) and a label derived from the filename or metadata.
- **Download**: https://www.kaggle.com/datasets/ang3loliveira/malware-analysis-datasets-api-call-sequences
- **Placement**: `ProgectPytorch/data/dynamic_api_call_data/`
- **Note**: This file has been preprocessed and placed in the folder.

## 2. mal_api_2019 
- **Description**: API and DLL sequence data for malware classification, used for fine-tuning and validation of DGSM-SCAM-GAT.
- **Classes**: 2 (malware vs. benign),Combine the benign samples from the dataset dynamic_api_call_sequence with the malicious samples from mal_api_2019 to form a two-class dataset.
- **Format**: CSV files with API and DLL call sequences, similar to `dynamic_api_call_sequence`.
- **Download**: https://www.kaggle.com/datasets/focatak/malapi2019
- **Placement**: `ProgectPytorch/data/mal_api_2019/`
- **Note**: This file has been preprocessed and placed in the folder.
            If you want to preprocess data,please using dgsm-scam-gat_yz_api_data_2019.py code, please download the dataset 
      and rename the dataset text file to `mal_api_2019.txt` and the labels text file to `mal_api_2019_lables.txt`. 
      Then, place them in the folder at the path `ProgectPytorch\data\mal_api_2019`. Ensure the file names are correct.
      Additionally, ensure that the API_name_307.xlsx file is placed in the folder at the path `ProgectPytorch\data\mal_api_2019`.

## 3. big2015
- **Description**: Malware dataset containing `.bytes` and `.asm` files, with labels in `big2015_Labels.csv`.
- **Classes**: 9
- **Format**: `.bytes` files (byte sequences), `.asm` files (assembly code), and a CSV file (`big2015_Labels.csv`) with `Id` and `Class` columns.
- **Download**: https://www.kaggle.com/c/malware-classification
- **Placement**: Place `.bytes`files and`.asm` files in the `ProgectPytorch/data/big2015/dataset_big2015/`,
                 and Place `big2015_Labels.csv` in the `ProgectPytorch/data/big2015/`.
- **Preprocessing**: Run `ProgectPytorch/mmt-ViT_data_preprocessing.py` to generate grayscale images, wavelet sequences, and instruction sequences (via `bash scripts/preprocess_data.sh`).

## 4. malimg
- **Description**: Grayscale images for malware classification, organized into 25 classes.
- **Classes**: 25
- **Format**: Grayscale PNG/JPEG images, each class in a separate subdirectory.
- **Download**: https://www.kaggle.com/datasets/ikrambenabd/malimg-original
- **Placement**: `ProgectPytorch/data/big2015_yz/malimg_25/data_in/`
- **Note**: Place the folders named after malware types from the malimg_25 dataset into the "ProgectPytorch/data/big2015_yz/malimg_25/data_in" folder. Ensure the folder name "data_in" is correct.

## 5. Malevis_malimg
- **Description**: Combined dataset of RGB and grayscale byteplot images for malware classification, organized into 31 classes.
- **Classes**: 31
- **Format**: RGB and grayscale PNG/JPEG images, each class in a separate subdirectory.
- **Download**: https://www.kaggle.com/datasets/gauravpendharkar/blended-malware-image-dataset
- **Placement**: `ProgectPytorch/data/big2015_yz/Malevis_malimg_31/in/`
- **Note**: Place the folders named after malware types from the download dataset into the "ProgectPytorch/data/big2015_yz/Malevis_malimg_31/in/" folder. Ensure the folder name "in" is correct.

## Notes
- Ensure datasets are placed in the correct directories as specified above.
- Run `bash scripts/preprocess_data.sh` to preprocess all datasets before training.
