
# HashMark: Watermarking Tabular/Synthetic Data for Machine Learning via Cryptographic Hash Functions

This project provides a framework for generating synthetic datasets, applying watermarking techniques, and evaluating the impact on machine learning model performance.

## Overview

This script performs the following:
1. Loads real-world datasets (`Wilt`, `Housing`, `HOG`, `Shopper`).
2. Trains a synthetic data generator (`CTGAN`, `TVAE`, `GaussianCopula`, or uses original data).
3. Evaluates a classifier trained on:
   - Synthetic data.
   - Watermarked synthetic data (via:
     - Floating-point modification, or
     - Hash-threshold constrained sampling).
4. Reports mean and standard deviation of classification accuracy across iterations.

## Dependencies

Make sure you have the following packages installed:

```bash
pip install pandas numpy scikit-learn xgboost sdv
```

You also need access to the following datasets (place them as CSVs in relevant folders):
- `wilt/training_complete.csv`, `wilt/testing.csv`
- `housing/housing.csv`
- `hog/hog.csv`
- `shopper/shopper.csv`

## Usage

You can run the script from the command line:

```bash
python hashmark.py -f Wilt -c XGB -s CTGAN -i 5
```

### Arguments

| Argument           | Description                                                 | Values                         | Required |
|--------------------|-------------------------------------------------------------|----------------------------------|----------|
| `-f` / `--dataset` | Dataset to use                                              | `Wilt`, `Housing`, `HOG`, `Shopper` | ✅       |
| `-c` / `--classifier` | Classifier model to evaluate                            | `XGB`, `RF`                     | ✅       |
| `-s` / `--generator` | Synthetic data generator                                 | `CTGAN`, `TVAE`, `GC`, `ORIG`   | ❌ (Default: `ORIG`) |
| `-i` / `--iteration` | Number of random train/test splits to evaluate over       | Integer (e.g. `10`)             | ❌ (Default: `10`) |
| `-t` / `--threshold` | Threshold for Constrained Sampling       | '1/4','1/3','1/2','2/3','3/4','1'        | ✅  |

## 🔍 Watermarking Techniques

Two watermarking strategies are supported:

### 1. **Unconstrained (Floating-point Adjustment)**
Modifies floating-point values so that their SHA256 hashes modulo 2 are `0`. Controlled by passing threshold as 1.

### 2. **Constrained Sampling**
Only accepts rows where a certain fraction of non-label cells hash to `0`.




## Output

The script prints out accuracy statistics:
```
Mean Accuracy (Synthetic Data): 85.00%
Standard Deviation (Synthetic Data): 2.30%
Mean Accuracy (Watermarked Data): 84.75%
Standard Deviation (Watermarked Data): 2.10%
```

It also warns when constrained sampling hits the retry limit and displays hash statistics per column.

## Notes

- `hash_to_bit` uses SHA256 to map each cell to a 0 or 1 bit. For simplicity, this is unseeded. We can also sample the seed and include this in the input to SHA256.
- Class label distribution is preserved during synthesis and watermarking.
- Label columns are never altered during watermarking.

## Example

```
python hashmark.py -f Shopper -c RF -s TVAE -i 3
```

This runs 3 iterations of training and evaluating a Random Forest classifier on synthetic and watermarked data generated using TVAE on the `Shopper` dataset.

## Implementation Notes

- Automatic handling of missing classes
- Warning capture for data type issues
- Progress tracking during sampling
- Seed-controlled randomness for reproducibility
- For regression-based dataset, the file name is hashmark-reg.py. There is only one available dataset King. There are three choices of regressors - RF, XGB, and Ridge. We will use the same three synthetic generators as before.

```
python hashmark-reg.py -f King -c RF -s TVAE -i 3
```
This runs 3 iterations of training and evaluating a Random Forest Regressor on synthetic and watermarked data generated using TVAE on the `King` dataset.
