Notes for adversarial trading program

Notes on code cleaning:

What should the defaults be?

save_fast_data_funtion()
	Can we comment this, and/or make variables more intuitively named?

Docstring in save_fast_data_multiprocessing() might be better suited for save_fast_data_function().

Go through args help strings.

20200214
Consider momentum models as a baseline. I am thinking this could mean the orderbook is mapped to a vector of entries {0, 1, 2} for down, flat and up at each timestep (maybe every second or five seconds) and this is the input to a NN. That could be compared to a model that just predicts whatever the movement is in the input sequence.

20200128
Good things are happening. We have three-class classification models training with two steps in the data-processing pipeline. We take data from Lobster and feed that into lobster_to_full_dataset.py to generate one file per day of data, then the dataset class within dataset_module.py loads the data as inputs, but generated labels if there are none.

20191122
Working code now. Tested on JPM Data from Lobster with good results. On a week's worth of last-hour data, we see out-of-sample R-squared of ~0.8 when using 30-second snapshots to predict 3 seconds into the future. Plots indicate good performace too.

20191121
Version of the code that runs now includes networks to handle size-weighted averaging and signal smoothing. Also, there is code to process Lobster data. Also, I'd like to add the capability to use multiple smoothed signals as input.



Everything below this point in this file is no longer useful. It pretains to old set-ups of this project and is being saved here for archival purpose only.
_______________________________________________________________________________

7/25:

LSTM network training now on small subset of stock data downloaded from Lobster. Some funny things happening with the output of the network: It is returning a label for each entry in the sequence, as opposed to one label for the sequence. I'm not sure how the cross-entropy loss is working, since this means the output of the network is a different shape than the targets array that is passed into the training routine. 

It is training to ~54% validation accuracy -- with many epochs at the same loss. Work is needed to report which categories are getting classified correctly.  

By normalizing data, specifically normalizing volume and price separately, much better results were achieved at the end of the day. Much much much more data is needed to find good hyper parameters -- but it looks like things are working as they should, at least with this baby dataset.

7/26:

Code developed to log runs. Now experimenting with various hyper parameters. Results along with parameter values are being stored in table.csv in whichever output directory is given at run time (or defaulting to checkpoint/).

After lots of runs, my conclusion is that this LSTM is having trouble reliably training on the data we have - this includes many combinations of hyperparameters and different event-horizons for labeling data.
Not conclusive results that the code is bad or the model is bad, just that more data will go a long way, and maybe also a conversation about the best way to organize the data and perhaps different network setups.

7/29:

Added vanilla RNN network class to LSTM.py, no better performance. Investigated loss curves, concluded that the learning rate was too high and changed it from 10 to 0.5.

7/30:

After a conversation with Micah, I realized that testing data was sequential raw data. This was fixed by shuffling the data before separating into training and testing. Performance is much more reasonable -- by that I mean that testing data is not being classified only as one class. Hyperparameter search in progress to get better results. It feels like there may not be enough data here. The learning rate schedule is being tuned using plots of the loss over epochs, with success. Still more to do as even the best sets of hyper parameters are showing test accuracy in the 40's for both VanillaRNN and LSTM. Also, I added GRU model and moved the model class definitions to RNN.py for organization. Also, I like the default values in main.py (as in, they give some of the better results I've seen so far). Default argument values have been messed up now, still trying to tune hyperparameters per model. More work to do on this.

8/7:

Returning to this project now. Working on SGD and regression today. Regression seems to be working. The results are hard to interpret. The loss curves decay quickly and to very low MSE loss in training. The testing loss is plateauing, not sure why yet. Added SGD and cleaned some of the implentation up. Regression training with SGD, but odd results. Some baby cases should be studied like linear models.

8/8:
 
Cleaned up some code and modified the plotting to output two plots. Everything seems like it's probably working but training is very hard with this little data.

8/9:

Added ResNet18, LinearModel, and Temporal Convolution model to the code. These are used for regression only at the moment, and the code could be cleaned up considerably with some time and effort. That should be done when a set of models that we'll actually use has been determined. Still hae not added moving averages.

8/13:

Code refactored. Moving averages added to the data_process routine, but since we will need to back-prop -- this should be built into the networks (as well as normalization).

8/14:

Project waiting for more information on data representation and the models used in the field. 

8/15:

Skype conversation with Ankit. To construct a timeseries that we can work with we will take an exponential moving average of the size averaged price at each tick and use that signal to predict the moving average at some future point (maybe one minute away). The first predictor to work on is a linear model.

8/20:

Essentially started development of models over today. I wrote a class to act as layer where forward() takes in order book data and outputs an exponential moving and size weighted average as a 'cleaned' signal. Also wrote LinearPredictor class which is a single layer linear network. The training routine in main.py is training this netowrk to do regression with the cleaned signal. Hyperparameters include sequence_length and horizon which when changed require reprocessing of the data. This can be done in main.py with the --make_data flag. The bash script in run.sh can be used to for a grid search over those and other hyperparmeters of the network. At this moment in time the whole program is in two files: main.py and utility.py. I think adding other signals to the inputs would be interesting. Ankit mentioned stacking EMAs with different decays, he also mentions predicting different times into the future with the same predictor. Micah has suggested using a fixed number of tick in the book as a measure of the input sequence length, but using wall time to measure the event horizon. This can be done with data available from Lobster but in the other file.

8/21:

Added multi-channel inputs where each input has a different timescale in the EMA. The targets have the same number of channels as the inputs right now. So if each channel of input corresponds to a different EMA, then the linear predictor is outputting a prediction for each of those curves. Output protocols and logging all checked and cleaned. Now creating nice and readable tables. Code cleaned up and edited.

10/15:

Back to this project! I ran run.sh today to make sure everything runs as I expected. The data from Lobster came with "null" at the end of some lines, and so a short python script to read each line and write out a clean version made quick work of getting the data back in the format I like. I do have an issue of getting multiple files from Lobster if the date range is big enough. Maybe a python script to handle that needs to be written too. Broadly speaking, the next step is to write a method to propogate a perturbation to the orderbook.
