# Toward Equation of Motion for Deep Neural Networks: Continuous-time Gradient Descent and Discretization Error Analysis

This repository contains experimental results of our paper "Toward Equation of Motion for Deep Neural Networks: Continuous-time Gradient Descent and Discretization Error Analysis."

![EoM of DNNs](images/figure1.svg)

## Requirements
- Python 3.6.8
- TensorFlow 2.3.0
- Numpy 1.18.5
- CUDA 11.3
- cuDNN 7.4.1

## Files and Directories
- `./train_mnist.py`
    - Training code.
- `./load_models.ipynb`
    - Evaluation. Load checkpoint models, calculate discretization error, and output results in CSV format.
- `./plot_figures.ipynb`
    - Plot figures using the CSV files.
- `./configs/config_mnist.yaml`
    - Config file used for training.
- `./configs/info_mnist.yaml` and `./configs/info_available_list.yaml`
    - Used for training.
- `./algorithm`, `./dataprocess`, `models`, and `utils`
    - Functions for training.
- `./images`
    - Images for README.md.
- **`./csv`**
    - Experimental results.

## Results: `./csv`
Discretization error and weight norms are saved as CSV files. All the experimental results in our paper can be reconstructed from these CSV files.
### File names
LR0, LR1, LR2, and LR3 mean learning rate 0.1, 0.01, 0.001, and 0.0001, respectively. WD0 and WD1 mean weight decay 0.01 and 0.001, respectively. GF means the original gradient flow, while EoM means the gradient flow with the counter term ${\bm \xi} = \tilde{\bm \xi}_0$, i.e., Equation of Motion with the leading counter term.
### Naming Rule
See also our paper's notation. Note that our model architecture is: Linear (`dense0`) -> Swich -> Linear (`dense1`) -> Batch Norm -> Linear (`dense2`) -> Softmax loss. 
- `GD step`: Number of steps of gradient descent.
- `norm_thperp0`: $||\bm{\theta}_{\mathcal{A}\perp}||$, where $\mathcal{A}$ is the last linear layer (`dense2`). The model is trainded with gradient descent (GD).
- `norm_thperp1`: $||\bm{\theta}_{\mathcal{A}\perp}||$, where $\mathcal{A}$ is the last linear layer (`dense2`). The model is trained with gradient flow (GF) or EoM.
- `norm_thperp1-0`: Difference between the above two.
- `norm_thperp1-0/0`: `norm_thperp1-0` divided by `norm_thperp0`.
- `normsq_dense10`: $||\bm{\theta}_{\mathcal{A}}||^2$, where $\mathcal{A}$ is the second linear layer (`dense1`). The model is trained with GD.
- `normsq_dense11`: $||\bm{\theta}_{\mathcal{A}}||^2$, where $\mathcal{A}$ is the second linear layer (`dense1`). The model is trained with GF or EoM.
- `normsq_dense1_1-0`: Difference between the above two.
- `normsq_dense1_1-0/0`: `normsq_dense1_1-0` divided by `normsq_dense10`.
- `sum_dense20`: Sum of the weight parameters of the last linear layer (`dense2`). The model is trained with GD.
- `sum_dense21`: Sum of the weight parameters of the last linear layer (`dense2`). The model is trained with GF or EoM.
- `sum_dense2_1-0`: Difference between the above two.
- `sum_dense2_1-0/0`: `sum_dense2_1-0` divided by `sum_dense20`.
- `errex_by_norm_all`: 
- `errex`: Discretization error of GF or EoM.
- `errth`: Simulated result of theoretical prediction of discretization error. It is given by $|| \mathbf{e}_{k} || = \frac{\eta^2}{2} || \sum_{s=0}^{k-1} ( H (\mathbf{\theta}(s\eta)) + \lambda I) \mathbf{g} (\mathbf{\theta}(s\eta)) ||$, where $O(\eta^3)$ is neglected and $\mathbf{e}_0$ is set to be exactly zero throughout the experiment. R.H.S. is approximated to $H({\bf \theta}(t)) + \lambda I \sim \frac{{\bf g} ({\bf \theta}(t) + \epsilon {\bf g} ({\bf \theta}(t))) - {\bf g} ({\bf \theta}(t) - \epsilon {\bf g} ({\bf \theta}(t)))}{2\epsilon}$, where $\epsilon$ is set to be 1e-7.
- `errth_100stp`: Sumulated result of theoretical prediction of discretization error. It is given by $||{\bf e}_k|| = ||\tilde{\bf e}_{100} + \frac{\eta^2}{2} \sum_{s=100}^{k-1}  ( H (\mathbf{\theta}(s\eta)) + \lambda I) \mathbf{g} (\mathbf{\theta}(s\eta))||$, where $\tilde{{\bm e}}_{100}$ is given by `errex`.

