# ICLR Submission Code: DGSM-SCAM-GAT and MMT-ViT

This repository contains the implementation of the **DGSM-SCAM-GAT** and **MMT-ViT** models for malware classification, as described in our ICLR submission. The code is organized in the `ProgectPytorch` directory, with clearly named files indicating the model (e.g., `sdgm_scam_gat_model.py` for SDGM-SCAM-GAT and `mmt_vit_model.py` for MMT-ViT). Each model's definition, training, and testing logic is contained within a single Python script.

## Overview
- **DGSM-SCAM-GAT**: A graph-based model combining DGSM, SCAM, and GAT for malware classification. It is trained and evaluated on the `dynamic_api_call_sequence` dataset and fine-tuned on the `mal_api_2019` dataset.
- **MMT-ViT**: A multimodal Vision Transformer-based model that processes grayscale images, wavelet sequences, and instruction sequences. It is trained on the `big2015` dataset (9 classes) and fine-tuned on the `malimg` dataset (25 classes) and `Malevis_malimg` dataset (31 classes).
- **Tasks**:
  - SDGM-SCAM-GAT: Classification on `dynamic_api_call_sequence`.
  - MMT-ViT: Classification on `big2015` (9 classes) and fine-tuning on `malimg` dataset (25 classes) and `Malevis_malimg` dataset (31 classes).

## Directory Structure

ProgectPytorch/
├── README.md                     # This file
├── requirements.txt              # Python dependencies
├── environment.yml               # environment
├── LICENSE                       # License file
├── dgsm-scam-gat_model.py       # DGSM-SCAM-GAT model definition, training, and testing on dynamic_api_call_sequence
├── dgsm-scam-gat_model_yz.py    # DGSM-SCAM-GAT model validation on mal_api_2019
├── dgsm-scam-gat_model_wt.py    # DGSM-SCAM-GAT model fine-tuning validation on mal_api_2019
├── dgsm-scam-gat_yz_api_data_2019.py    # DGSM-SCAM-GAT model validation data preprocessing
├── mmt-ViT_multimodal_model.py          # MMT-ViT model definition, training, and testing (big2015)
├── mmt-ViT_data_preprocessing.py        # Data preprocessing for MMT-ViT (big2015)
├── mmt-ViT_multimodal_wt_yz_25.py       # MMT-ViT fine-tuning for malimg (25 classes)
├── mmt-ViT_multimodal_wt_yz_31.py       # MMT-ViT fine-tuning for Malevis_malimg (31 classes)
├── mmt-ViT_gray_only.py                 # MMT-ViT keep only the grayscale-image branch for final model validation.
├── config.py                 # Configuration file (hyperparameters, paths)
├── data/                     # Datasedirectory (symlinks or placeholders)
│   ├── dynamic_api_call_data/  # dynamic_api_call_sequence dataset
│   ├── mal_api_2019/           # This file contains the mal_api_2019 dataset, a mix of malicious and benign software data, with preprocessed data in the folder.
│   ├── big2015/dataset_big2015/    # big2015 dataset (.bytes, .asm, labels)
│   ├── big2015_yz                  
│   │      ├── malimg_25/           # malimg dataset (25 classes)
│   │      └── Malevis_malimg_31/   # Malevis_malimg dataset (31 classes)
│   └── mal_api_2019/               # API and DLL sequence data
├── results/                  # Training results and logs
│   ├── dgsm_scam_gat/        # SDGM-SCAM-GAT results
│   └── mmt_vit/              # MMT-ViT results (big2015, malimg 25 classes, malimg and Malevis 31 classes)
├── scripts/                      # Scripts to run experiments
│   ├── preprocess_data.sh        # Data preprocessing script
│   ├── run_dgsm_scam_gat.sh      # SDGM-SCAM-GAT execution script
│   ├── run_dgsm_scam_gat_fine-tuning.sh      # SDGM-SCAM-GAT fine-tuning script
│   ├── run_mmt_vit.sh            # MMT-ViT execution script (big2015)
│   └── run_mmt_vit_finetune.sh   # MMT-ViT fine-tuning script (malimg)
└── docs/                         # Additional documentation
    ├── dataset_description.md     # Dataset details
    └── model_results_description.md       # Model architecture details. Instructions for reproducing results.

## Prerequisites
- **Hardware**: GPU (NVIDIA RTX 4070 or higher recommended) with CUDA 11.8 support.
- **Software**:
  - Python 3.12 or higher
  - See `requirements.txt` for complete dependency list, including `transformers`, `numpy`, `pandas`, `scikit-learn`, `opencv-python`, `pywt`, `matplotlib`, `seaborn`, and `torchmetrics`.

## Install dependencies using pip:
  - pip install -r requirements.txt

## Dataset
  - **dynamic_api_call_data**:
    - Description: Dataset containing dynamic API call sequences for malware classification.
    - Classes: 2
    - Download: https://www.kaggle.com/datasets/ang3loliveira/malware-analysis-datasets-api-call-sequences
    - Placement: Place in ProgectPytorch/data/dynamic_api_call_data/
    - Note: This file has been preprocessed and placed in the folder.

  - **mal_api_2019**:
    - Description: API and DLL sequence data for malware classification, used for fine-tuning and validation of DGSM-SCAM-GAT.
    - Classes: 2. Combine the benign samples from the dataset dynamic_api_call_sequence with the malicious samples from mal_api_2019 to form a two-class dataset.
    - Download: https://www.kaggle.com/datasets/focatak/malapi2019
    - Placement: Place in ProgectPytorch/data/mal_api_2019/
    - Note: This file has been preprocessed and placed in the folder. 
            If you want to preprocess data,please using dgsm-scam-gat_yz_api_data_2019.py code, please download the dataset 
      and rename the dataset text file to "mal_api_2019.txt" and the labels text file to "mal_api_2019_lables.txt". 
      Then, place them in the folder at the path "ProgectPytorch\data\mal_api_2019". Ensure the file names are correct.
      Additionally, ensure that the "API_name_307.xlsx" file is placed in the folder at the path "ProgectPytorch\data\mal_api_2019".

  - **big2015**:
    - Description: Malware dataset with .bytes files, .asm files, and labels (big2015_Labels.csv).
    - Classes: 9
    - Download:https://www.kaggle.com/c/malware-classification
    - Placement: Place the .bytes and .asm files in the ProgectPytorch/data/big2015/dataset_big2015  folder,
      and place the big2015_Labels in the ProgectPytorch/data/big2015 folder.

  - **malimg**:
    - Description: Grayscale images for malware classification, organized into 25 classes.
    - Classes: 25
    - Download:https://www.kaggle.com/datasets/ikrambenabd/malimg-original
    - Placement: ProgectPytorch/data/big2015_yz/malimg_25/data_in/
    - Note: Place the folders named after malware types from the malimg_25 dataset into the "ProgectPytorch/data/big2015_yz/malimg_25/data_in" folder. Ensure the folder name "data_in" is correct.

  - **Malevis_malimg**:
    - Description: Handling RGB and Grayscale byteplot images together in one dataset for malware classification, organized into 31 classes.
    - Classes: 31
    - Download:https://www.kaggle.com/datasets/gauravpendharkar/blended-malware-image-dataset
    - Placement: ProgectPytorch/data/big2015_yz/Malevis_malimg_31/in/
    - Note: Place the folders named after malware types from the download dataset into the "ProgectPytorch/data/big2015_yz/Malevis_malimg_31/in/" folder. Ensure the folder name "in" is correct.
    
## Note: Due to size constraints, datasets are not included in this package. Please download from the link corresponding to the dataset mentioned above

## Running Experiments

## When executing run_mmt_vit.sh and run_mmt_vit.sh, i.e., running all the code for training, fine-tuning, and validating the MMT-ViT model, 
## since the pretrained model "google/vit-base-patch16-224" is used, it is essential to ensure a stable internet connection to download the 
## pretrained model "google/vit-base-patch16-224". Otherwise, an error will occur.

 - **Data Preprocessing**
   - Run the preprocessing script to generate features for both models:
     bash scripts/preprocess_data.sh
   - This executes:
     ProgectPytorch/dgsm-scam-gat_yz_api_data_2019.py : Processes Verify dataset mal_api_2019 for SDGM-SCAM-GAT validation and fine-tuning validation.
     ProgectPytorch/mmt-ViT_data_preprocessing.py : Generates grayscale images, wavelet sequences, and instruction sequences from big2015 for MMT-ViT
   - Note: The "bash scripts/preprocess_data.sh" only needs to be executed once.

 - **DGSM-SCAM-GAT**
   - Train and evaluate the DGSM-SCAM-GAT model on the dynamic_API_call_sequences dataset:
     bash scripts/run_dgsm_scam_gat.sh
   - This runs ProgectPytorch/dgsm_scam_gat_model.py, which includes model definition, training, and testing.

 - **DGSM-SCAM-GAT validation and Fine-Tuning**
   - Fine-tune the DGSM-SCAM-GAT model on the mal_api_2019 dataset:
     bash scripts/run_dgsm_scam_gat_fine-tuning.sh
   - This executes ProgectPytorch/dgsm-scam-gat_model_yz.py to validation the model, and executes ProgectPytorch/dgsm-scam-gat_model_wt.py to Fine-Tuning the model.

 - **MMT-ViT**
   - Train and evaluate the MMT-ViT model on the big2015 dataset:
     bash scripts/run_mmt_vit.sh
   - This runs ProgectPytorch/mmt-ViT_multimodal_model.py, which includes model definition, training, and testing.

 - **MMT-ViT Fine-Tuning**
   - Fine-tune the MMT-ViT model on the malimg dataset for 25-class and 31-class classification:
     bash scripts/run_mmt_vit_finetune.sh
   - This executes:
     ProgectPytorch/mmt-ViT_multimodal_wt_yz_25.py: Fine-tunes on malimg 25-class dataset
     ProgectPytorch/mmt-ViT_multimodal_wt_yz_31.py: Fine-tunes on Malevis_malimg 31-class dataset

## Results
- **Output Location**
  - DGSM-SCAM-GAT: Results saved in ProgectPytorch/results/dgsm_scam_gat/. 
  - MMT-ViT (big2015 9-class): Results saved in ProgectPytorch/results/mmt_vit/ . 
  - MMT-ViT (malimg 25-class, fine-tuning validation): Results saved in ProgectPytorch/results/mmt_vit/big2015_yz/yz_results_25. 
  - MMT-ViT (Malevis_malimg 31-class, fine-tuning validation): Results saved in ProgectPytorch/results/mmt_vit/big2015_yz/yz_results_31.
- **Metrics**
  - Includes accuracy, precision, recall, F1 score, and confusion matrices.
- **Reproducing Results**
  - To reproduce the results reported in the paper:
    Download and place datasets in ProgectPytorch/ as described above. 
  - Run the preprocessing script: bash scripts/preprocess_data.sh, this file only needs to be executed once.
  - Execute training and evaluation scripts:
    SDGM-SCAM-GAT: bash scripts/run_dgsm_scam_gat.sh
    SDGM-SCAM-GAT(validation and Fine-Tuning): bash scripts/run_dgsm_scam_gat_fine-tuning.sh
    MMT-ViT (big2015): bash scripts/run_mmt_vit.sh 
    MMT-ViT (fine-tuning):bash scripts/mmt-ViT_multimodal_wt_yz_25.py.sh and bash scripts/mmt-ViT_multimodal_wt_yz_31.py.sh 
  - Check results for metrics and visualizations.













































