!NOTE: All the data used during this research are not available with this submission.
Only a sample text file is provided for the reference. If required, Authors would 
like to produce the complete data-set upon a formal request.

## Pre-requesties:
=================================================================================
- Python3.x
- Pip

## Pre Processing Steps
=================================================================================
All the data for each countries ICE corpra is availble as text files.
Files are processed through a notebook and written into csv called `ice-merged.csv`

`./pre-processing` folder contains the notebook used to create this csv file out of 
all ICE corpra.

--pre-processing
  `--create_csv.ipynb            <----- Notebook used to do pre-processing
  `--ICE-CAN
     `---Corpus
     	 `--W1A-001.TXT          <----- A sample data file for Canada corpus
     	 

## Running A Sample Training Experiment
=================================================================================
Step 1: Run `pip install -r requirements.txt` to install dependencies
Step 2: 
To run the experiments on the training set you can either run the shell script;
Run `./run_classifier_sgd.sh`
Or to run specific test;
Run `python3 main.py --category 12 --ngram-range 1,1 --train 0.7 --analyzer word --strip-accents --clf sgd --ice-corpus --no-trim  --silent --no-dump`

## Workflow followed
=================================================================================
1. Training / Best Model Selection on Training DataSet
Shell scripts used to run for each ML model;
./run_classifier_sgd.sh -> Run SVM with SGD training tests on k-Fold
./run_classifier_mnb.sh -> Run Multinomial Naive Bayes training tests on k-Fold
./run_classifier_rfc.sh -> Run Random Forest training tests on k-Fold
./run_classifier_dst.sh -> Run Decision Tree training tests on k-Fold
To run a single;
python3 main.py --category 12 --ngram-range 1,1 --train 0.7 --analyzer word --strip-accents --clf sgd --ice-corpus --no-trim  --silent --no-dump

2. Run Tests on Evaluation DataSet
Shell script to run tests;
./run_classifier_tests.sh

3. Finding Most Significant Features
First vectorizer model and classifier model need to be saved into disk.
Shell script to run dumps;
./run_dump_classifier.sh

To dump a single model;
python3 main.py --category 31 --ngram-range 1,1 --train 0.7 --analyzer word --run-type dump --strip-accents --clf mnb --ice-corpus

Then, dumped vectorizer and classifier will be saved into /models folder. 
These vecorizer file and classifier file names needs to be modified in /classifiers/common.py file before below step.

Syntax for these files would be; `<category_id>_<vec or clf name>_<ngram>_<randomid>` where <randomid> is same for both vectorizer and classifier.

Sample format of a vectorizer file;
"28_vec_word_1_1_1596698270.joblib",
Sample format of a classifier file;
"28_mnb_clf_word_1_1_1596698270.joblib",

Then Run Plot features shell script;
./run_plot_classifier.sh

## Sample Output
Sample output of the sytem:
```
['train_precentage', 'strip_accents', 'lemmatization', 'analyzer', 'ngram_range', 'stop_words', 'max_iter', 'languages', 'trim_by_category', 'training time', 'testing time', 'accuracy', 'precision', 'recall', 'f1-score']
0.7, True, False, char, (1| 6), None, 50, ['native'| 'non_native'], True,5.25s, 0.03s,0.7895, 0.79, 0.79, 0.79
0.7, True, False, char, (1| 7), None, 50, ['native'| 'non_native'], True,1256.76s, 8.81s,-0.0000, 0.58, 0.15, 0.24
```
