Non-reversible Parallel Tempering for Uncertainty Approximation

Preliminaries

Set up the random seed

Build a non-convex energy function

Partition the X-Y axis

Show the ground truth distribution

Build the samplers (including baselines, ensemble SGLD, cycSGLD, and DEO)

Total number of running iterations

Total number of chains

Standard deviation of gradient / energy estimator

[1/4] Baselines: ensemble SGLD (16 parallel runs)

Comment: Ensemble SGLD can identify most of the modes, but fails to quantify the uncertainty very well.

[2/4] Baselines: cyclic SGLD (16 times of running time)

Comment: Cyclic SGLD seems to have a good exploration, but the uncertainty estimate is still not good enough.

[3/4] Baselines: Run PT via the naive deterministic even-odd (DEO) scheme (DEO-SGD)

The default DEO is equivalent to setting a window of size 1

Set up the hyperparameters

Fix the lowest & highest learning rates, target swap rate

Other inputs

Geometrically initialize learning rates for other chains

Initialize the temperature, which is 1 for SGLD and 0 for exploration kernels via SGD

Hyperparameter for step size of stochastic approximation

Draw of density for the naive DEO scheme

Comment: from the rightmost figure, we see parallel tempering has a mediocre performance given high round trip costs.

Study the index process

Comment: The above index process takes a long time to finish a single round trip with the naive DEO scheme.

[4/4] Run PT via the generalized DEO with the optimal window size (DEO$_{\star}$-SGD)

According to our theory, the optimal window size follows that

Run the same algorithm again

Draw of density for the DEO$_{\star}$ scheme

Comment: given the same training cost, non-reversible PT with the optimal window size approximates the ground truth density the best.

Study the index process

Comment: compared with the previous figure, we see less swaps in a window, but the round trips become more deterministic and regular.

Comparison of round trips

Convergence of acceptance rates