## data preprocess
Before conducting data preprocessing, please ensure that the Joern tool has been installed. In this work, we adopt the Joern==4.0.365 version.
The entire data preprocessing step is carried out in the following file sequence, replacing the parameters with the corresponding actual parameter values.
---
### parse.py
This Python script processes source code functions in different programming languages (C/C++, Java, Python) to extract their code property graphs (CPG) using the Joern toolchain. It reads a JSONL input file containing multiple functions, generates intermediate graph representations, filters and processes nodes and edges, and finally outputs a processed JSONL file suitable for downstream tasks such as code analysis or vulnerability detection.
#### Usage
   The input file should be a JSON list of dictionaries, each containing at least:
   * "idx": unique function ID or index
   * "func": the source code string of the function
   * "target": Whether the target has a defect, with 1 indicating defect presence
  
Run the script from the command line, eg:
   ``` console
   $ python parse.py --language cpp --data_file test_data.jsonl --output_dir test_cpg --save_dir test_save --output_file output.jsonl --max_workers 8 --node_num_threhold 8
   ```
#### Arguments
* --language: Language type (cpp, java, or python).
* --data_file: Path to input JSONL file.
* --output_dir: Directory to save intermediate temporary files. 
* --save_dir: Directory to save final outputs. 
* --output_file: Path to write the final processed JSONL results. 
* --max_workers: Number of concurrent threads for parallel processing (default 10). 
* --node_num_threhold: Threshold for node number filtering (default 8).

#### Output:
   
   A JSONL file where each line is a processed function sample with nodes, edges, labels, and so on.

### check_data.py
This Python script analyzes the distribution of node counts from a JSONL dataset containing function samples. It calculates basic statistics such as minimum, maximum, average, median, and standard deviation of node counts, counts samples by target labels, and plots a histogram showing the distribution of node counts.
#### Usage
Run the script from the command line with the required input file and optional output name prefix:
 ``` console
python check_data.py --input path/to/input.jsonl --output node_distribution
```
#### Arguments
* --input (required): Path to the input JSONL file containing the dataset.
* --output (optional): Prefix name for the output histogram image file (default: node_distribution).

#### Output
* Prints dataset size, counts of samples with target labels 0 and 1, minimum and maximum node counts and the IDs of corresponding samples.
* Prints statistics including maximum, average, median, and standard deviation of node counts.
* Saves a histogram PNG file (e.g., node_distribution.png) showing the distribution of node counts grouped in intervals of 100.

### check_repeat.py
This script removes adjacent duplicate `nodes_codes` entries from a JSONL dataset of function samples. When two consecutive records have identical `nodes_codes`, one of them is randomly deleted (unless one was already marked for deletion, in which case the later one is removed). All deletions are recorded with context in a separate log file. Summary statistics are printed at the end.
#### Usage
Run the script from the command line with the required input file and optional output name prefix:
 ``` console
 python check_repeat.py --input path/to/input.jsonl --output path/to/cleaned_output.jsonl --log path/to/deletion_log.jsonl
 ```
#### Arguments
* input: Path to the input JSONL file.
* output: Path where the cleaned JSONL output will be saved.
* log: Path where the deletion record log (JSONL) will be saved.

#### Output
* Cleaned data JSONL: original dataset with selected adjacent duplicates removed.
* Deletion log JSONL: one record per deletion, includes fields like index, id, filename, action, reason, target (if available), and timestamp.
* Console summary: original row count, cleaned row count, deleted rows, and file paths.

### process_again.py
This Python script automates the process of parsing source code functions into Code Property Graphs (CPG) using Joern, exporting Abstract Syntax Trees (ASTs) in DOT format, extracting graph representations, and handling reprocessing of specific samples. It supports concurrent processing with thread-safe logging and filtering of nodes and edges to produce cleaned graph data suitable for vulnerability detection or other code analysis tasks.
#### Usage
Run the script from the command line with the following arguments:
 ``` console
python process_again.py --language cpp --data_file path/to/input.jsonl --output_dir path/to/output_cpg_dir --save_dir path/to/save_dir --process_file path/to/original_processed.jsonl --process_file2 path/to/reprocessed_output.jsonl --output_file final_output.jsonl
```
#### Arguments
* --language: Language type (cpp, java, or python).
* --data_file: Path to input JSONL data file.
* --output_dir: Directory to save intermediate outputs like CPG binaries and DOT files.
* --save_dir: Directory to save the final processed JSONL output files.
* --process_file: Path to the originally processed JSONL file for reprocessing.
* --process_file2: Path to save the updated JSONL file after reprocessing.
* --output_file: Filename for the final reprocessed JSONL output (saved under save_dir).
* --max_workers: Number of threads to use for concurrent processing (default: 10).
* --node_num_threhold: Minimum node count threshold to keep a processed graph (default: 8).
#### Output
* Intermediate files (binary CPG and DOT AST files) are saved under the specified --output_dir.
* The final processed graph data for each function is saved as a JSONL file at --save_dir/--output_file.
* When reprocessing is performed, the updated JSONL file is saved to --process_file2.
### process_NS_label.py
This script adds NATURE_SEQUENCE edges and corresponding labels to graph-structured data stored in JSON Lines (.jsonl) files. Each data item represents a graph with nodes, edges, and edge labels. The NATURE_SEQUENCE edges connect nodes on consecutive lines of source code to enrich the graph representation.
#### Usage
Run the script from the command line:
 ``` console
python process_NS_label.py --input_files input1.jsonl input2.jsonl --output_files output1.jsonl output2.jsonl
 ```
#### Arguments
* --input_files: One or more input JSONL file paths containing graph data.
* --output_files: Corresponding output JSONL file paths where processed graphs will be saved. Must match the number of input files.
#### Output
Each output file will contain the same data as the input, but with additional edges labeled NATURE_SEQUENCE that connect nodes on consecutive code lines, enriching the graph structure for downstream analysis or model training.

### clean_graph.py
This script cleans and renumbers graph data from JSONL datasets by removing empty nodes or nodes containing only a single word. It also removes edges connected to those filtered nodes, ensuring the resulting graph is compact and valid. The processed graphs are saved as new JSONL files for further use.
#### Usage
Update the input_files and output_files lists in the script to match your file paths.

Run the script:
 ``` console
python process_graph.py
```
#### Output
* Each output file contains the processed graph data with filtered nodes and updated edges.

* Summary is printed after processing each file:Total number of original samples、Number of samples remaining after cleaning and Count of removed empty graphs.