# Code walkthrough
 

## Real-data Experiments 

### Load and preprocess data
To download the agnews data ([training set](https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv) & [testing set](https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv))  and preprocess data following our description in our paper, execute the script `preprocess_agdata.py`. This will save data files in `\data` folder. 


### Run Semi-supervised experiment
The script used to run our experiments is `run_experiments.py`. To specify the models and parameter, you should invoke functions in the main function of the script. 

We provided two main interfaces below:

```python
runExperiment_model(run_type="one",skip_unsup=False,unsup_id=None,model_id=None,n_layers=None,representation_dim=None,embed_dim=5000,h_dim=None,nepochs=150,opt_type='amsgrad',w_decay=0,resample=2,dropout_p=0,lr=0.0002,prev_model_file=None,n_samples=4000)

runExperiment_baseline(run_type="one",unsup_id=None,embedding_type='BOW',n_samples=4000)
```

The `runExperiment_model` function is for training a nn model for unsupervised learning of representations and then conducting supervised learning on training set. The function will itself save model and test accuracy results in `\result` and `models` folders. Similarly, `runExperiment_baseline` is for conducting the same procedure for baseline representations.

we document below parameters that are not model hyperparameters nor intuitive:

For `runExperiment_model`:
* **run_type**: if value is `'one'`, conducted supervised experiment with only once with a random selection of `n_samples` training samples. If value is `'df'`, conducted the supervised experiment for multiple datapoints with repetition to generate confidence interval, and the dataframe saved could be used to generate figure as shown in our paper.
* **skip_unsup**: if value is `True`, skipped the unsupervised part and used previous model loaded from file path given in  `prev_model_file`
* **unsup_id**: identifier of a single run. Used to index results/folder name
* **model_id**: specifier of a particular model. We provided the following options: 
    * `'base'`: Residual model with end of last residual layer as representation (what we actually reported in the paper)
    * `'contrastive'`: Contrastive model by Tosh, et al.[[1]](#1)
    * `'word2vec_train'`: residual model with last layer multiplied by word2vec matrix as representation. To use this option, first generate and save your embedding matrix in `data/word2vec_matrix_trained.npy` file. 


For `runExperiment_baseline`:
* **run_type**: Same as in `runExperiment_model`.
* **unsup_id**: identifier of a single run. Used to index results/folder name
* **embedding_type**: specifier of a particular baseline representation typ. We provided the following options:
    * `BOW`: Bag of words 
    * `word2vec`: word2vec trained on unsupervised dataset


### Train word2vec embeddings

To Train word2vec embeddings on unsupervised dataset, execute the script `train_word2vec.py` (tune the training parameter in the main function of the script) and the resulting embeddings matrix will be saved in data folder. This matrix could be later used to initialize some of our model options.  


## Acknowledgement
We adapted some of our code from Tosh, et al.[[1]](#1).

## References
<a id="1">[1]</a> 
Christopher Tosh,  Akshay Krishnamurthy,  and Daniel Hsu.   Contrastive estimation reveals topicposterior information to linear models.arXiv preprint arXiv:2003.02234, 2020