# A Probabilistic Approach to Constrained Deep Clustering.

## Motivation

Clustering with constraints has gained significant attention in the field of semi-supervised machine learning as it can leverage partial prior information on a growing amount of unlabelled data. 
Following recent advances in deep generative models, we derive a novel probabilistic approach to constrained clustering that can be trained efficiently in the framework of stochastic gradient variational Bayes. 
In contrast to existing approaches, our model (CVaDE) uncovers the underlying distribution of the data conditioned on prior clustering preferences, expressed as \textit{pairwise constraints}. The inclusion of such constraints allows the user to drive the clustering process towards a desirable configuration by indicating which samples should or should not belong to the same class.

## Data Download

To download Reuters data, run the following:

`cd dataset/reuters`

`sh download_data.sh`

To download and configure the UTKFace datset:
- Download the cropped and aligned dataset archive from https://susanqq.github.io/UTKFace/
- Extract the images from this archive to `<code root>/dataset/utkface`

## Implementation

To run Constrained VADE using the default setting on MNIST data set:

`python main.py --pretrain True`

To run VADE without pairwise constraints using the default setting:

`python main.py --pretrain True --num_constrains 0`

To choose different configurations of the hyper-parameters:

`python main.py --num_constrains ... --alpha ... --lr ...`

Important hyper-parameters:
- data: choose from MNIST, fMNIST, Reuters, har, utkface
- num_constrains: by default it should be set to the number of samples in the data N (note that the total number of pairwise constraints in a dataset is N*N)
- alpha: measure the confidence in your labels (default is 10000 which means one is very sure the labels are correct)
- batch_size: should be quite big (default 1024) otherwise it is difficult to have pairwise constraints inside a batch
- lr: learning rate
- decay: exponential decay of the learning rate
- num_epochs
- pretrain: False if you want to use your own pretrain weights

### Pairwise constraints 

In the current implementation, the pairwise constraints are obtained from labels by randomly sampled two data points and assigning a must-link constraint (+1) if the two samples have the same label and a cannot-link constraint (-1) otherwise. The pairwise constraints are stored in a matrix W.
See the file:
`source/data.py`
