###########################################
#  DESIGN NOTES FROM PRUNING NEURAL ODEs  #
###########################################

An ablation study over different types of features of neural ODE models. 

We test different configuration by applying our SparseFlows (Algorithm 1)
to investigate what type of network configurations are most stable
and robust with respect to pruning. We hope to better understand how 
to design neural ODE flows from this.

For each type of sweep (ablation), we highlight one key study and one key result: 

###
# SWEEP OVER OPTIMIZATION PARAMETERS
###

* Setup: 
  We study the stability of different configurations for the optimizer
  and how the different configurations affect the generalization 
  performance during pruning.

* Key experiment: 
  classifications_moon/opt_sweep_unstructured

* Key observation: 
  We can find the most stable parameter configuration for the optimizer by 
  considering sparisifying the flow and thus inducing additional 
  regularization. The most stable optimizer configuration is the one for
  which we can achieve the most pruning.


###
# SWEEP OVER MODEL SIZES - DEPTH VS. WIDTH
###

* Setup: 
  We study different network configurations with (approximately) the same
  number of parameters. The networks differ in the depth vs width configuration.
  We test deep and narrow vs shallow and wide. 

* Key experiments: 
  ffjord_gaussians/model_sweep_unstructured
  ffjord_spirals/model_sweep_unstructured
  

* Key observation:
  Increasing the depth of the network while reducing the width of the network
  in general does not help improving the generalization performance of the 
  network over different prune ratios. Specifically, one should pick the
  minimal depth of the RHS that ensures convergence. Usually, any depth beyond
  that does not help improve the generalization performance of the flow.


###
# SWEEP OVER ACTIVATIONS
###

* Setup:
  We study the same network configurations for the same amount of pruning
  and vary the activation function of the neural network on the RHS. 
  As we prune, we hope to unearth which activation function is most robust
  to pruning and consequently to changes in the architecture. 

* Key experiment:
  ffjord_gaussians/activation_sweep_unstructured

* Key observations: 
  ReLU is usually not a very useful activation function. Rather, some Lip-
  schitz continuous activation functions is most useful. Generally, we 
  found tanh and sigmoid to be most useful, although sigmoid was
  probably the most robust single configuration across all experiments


###
# SWEEP OVER ODE SOLVERS
###

* Setup:
  We study the same network configurations for the same amount of pruning
  and vary the ODE solver of the neural ODE flow. 
  As we prune, we hope to unearth which solver is most robust
  to pruning and consequently to changes in the architecture. 

* Key experiment: 
  ffjord_gaussians/solver_sweep_unstructured

* Key observations:
  Generally, we found adaptive step size solvers (dopri5) to be superior
  over fixed step size solvers (rk4, euler). Moreover, we found back-
  propagation through time (BPTT) to be slightly more stable than
  the adjoint method. Interestingly enough, we could often times only
  observe the differences between the robustness of the different solvers
  after we start pruning and sparsifying the flows. 