\documentclass{article}


% Optional math commands from https://github.com/goodfeli/dlbook_notation.
\input{math_commands.tex}

\usepackage{hyperref}
\usepackage{url}

\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      %
\usepackage{xcolor}         % colors
\usepackage[linesnumbered,ruled,vlined]{algorithm2e}
\SetKwInput{KwInput}{Input}                % Set the Input
\SetKwInput{KwOutput}{Output} 
\usepackage{tabularx}
\usepackage{graphicx}
\usepackage{subfigure}
%\usepackage{subfigure}
%\usepackage[notref,notcite]{showkeys}
\usepackage{mathrsfs}
\usepackage{amsmath,amssymb,amsthm,amsbsy,latexsym,dsfont,tikz}
\usepackage{wrapfig,lipsum}

%\usepackage{graphicx,color}
%\usepackage[numeric,initials,nobysame]{amsrefs}
%\usepackage{upref,setspace}
%\usepackage{bbm,soul} %\st,\ul from soul package
%\usepackage{epstopdf}
%\DeclareGraphicsRule{.tif}{png}{.png}{`convert #1 `dirname #1`/`basename #1 .tif`.png}

\usepackage{enumerate}
\newenvironment{enumeratei}{\begin{enumerate}[\upshape (i)]}{\end{enumerate}}
\newenvironment{enumeratea}{\begin{enumerate}[\upshape (a)]}{\end{enumerate}}
\newenvironment{enumeraten}{\begin{enumerate}[\upshape 1.]}{\end{enumerate}}
\newenvironment{enumerateA}{\begin{enumerate}[\upshape (A)]}{\end{enumerate}}

\usepackage{enumitem}

\newcommand{\XC}[1]{{{\textcolor{red}{#1}}}}

\title{Rebuttal}

\begin{document}

\maketitle

\section*{Reply to Reviewer 1NDe}

{\it I am not sure about whether this net can be generalized to unseen distribution Wasserstein distance calculations especially the ones are very different from the training data. Since in many times we don't know whether the distributions in the application field is similar to the ones in training, and if we just calculate the OT question, we can get the accurate results although it will be slow. But I think it is hard to ensure such a learned neural operator is generally applicable. If need retrain for OOD cases, then I want to know how much data is needed.

Question: How much data is needed for retraining for OOD cases?}

{\bf Reply.} We thank you for pointing the OOD retraining issue. To clarify, our OOD experiments on 1D and 2D Gaussian mixtures were tested upon data that was not seen in training, but still belonging to the same family of Gaussian mixture distributions with different set of variance parameters from training. Recall that the training dataset size for 1D Gaussian mixtures is 20,000 pairs and for 2D Gaussian mixtures is 5,000 pairs. In the current Gaussian mixture setting, if we were to retrain the experiments on new data with different variance parameters of mixture Gaussian components for GeONet to achieve similar $L^1$ errors, we empirically observe around 500 new pairings are needed.

\medskip

%that has not been seen in training but is somewhat related to the data in which GeONet was trained, the neural network parameters would likely be considerably initialized towards the new GeONet input. 

%If some new data is supplemented into the dataset in which GeONet was previously trained, we expect GeONet would require less data, perhaps around 500 pairings, depending on how well we would hope to predict a pairing from this new family.  GeONet would generally require 500+ or perhaps thousands of new data pairings to encapsulate an entire retraining of new data, completely unlike data of prior instances of training, which is not atypical.

{\it I don't quite understand the justification of "no need for retraining for new input", I think it is for applicable for all the usages.}

{\bf Reply.} We apologize for the confusion. This sentence was meant to compare with traditional OT solvers based on optimization (such as iterative Bregman projections [1]) and PINNs that learns the solution of a {\bf given} (i.e., {\bf single}) PDE (not the solution operator as a mapping from the boundary conditions to PDE solutions). GeONet is an operator learning method that we do not need to retrain the OT dynamics for new boundary conditions. To accommodate your concern, we changed ``no need for retraining for new input" to ``operator learning" in Table 1 in the revision.

%Like every other learning based methods after training, GeONet computes geodesic in simple forward passes.

\medskip

%does not need to be retrained for a new pairing of input. The geodesic is computed with simple forward passes of the neural networks with a given instance of the neural network parameters. Generally, the test data would belong to the same family of distributions as those seen in training for the greatest test accuracy.

[1] Benamou, Carlier, Cuturi, Nenna, Peyr\'e. (2015) Iterative Bregman Projections for Regularized Transportation Problems. {\it SIAM Journal on Scientific Computing.}

\medskip

{\it Also, very important, the readability is too bad, I think myself is familiar with many maths inside, however the symbols are not defined precisely make it very difficult to understand. I think much more details of the proposed algorithms and backgrounds need to be added. It shall be an important and interesting paper, but if it can not be understood by others then it will be difficult for application fields to use it. Better have more figures about the details as well.

Question: I think this paper is rushed, better improve the writing.}

{\bf Reply.} Thanks for your suggestion. We revised the paper to address questions raised by all reviewers. We hope the rebuttal is further clarifying and revised version of the paper satisfactory.

\section*{Reply to Reviewer NBJx}

{\it Lack of comparisons and related work: [1,2] both amortize Wasserstein geodesic learning. These should be at least cited and discussed.}

{\bf Reply.} Thanks for pointing out the references. In the revised version, we added these two references and made a brief discussion.

\medskip

{\it There are many methods to compute Wasserstein geodesics relatively quickly, although without amortization. I would be curious how the quality of interpolation compares to these more recent methods, and would amend this statement for more recent work [e.g. 3,4,5].

Recently, a machine learning method to compute the Wasserstein geodesic for a given input pair of probability measures has been considered in (Liu et al., 2021).}

{\bf Reply.} In the revision, we added comparison with two more suggested learning-based methods: the rectified flow (RF) [2] and conditional flow matching (CFM) [3]. We run a similar 2D Gaussian mixture simulation setup in the discrete setting, constructing empirical distributions from sampled point clouds belonging from fixed densities. We use POT as the ground truth for comparing GeONet, RF and CFM. The $L^1$ estimation error for geodesic at time point $t = 0, 0.25, 0.5, 0.75, 1$ are reported in Table 3 and an estimated geodesic example is shown in Figure 4. There are a few observations we can draw from this new experiments. First, RF and CFM have 3-4 times comparably larger estimation errors than GeONet, except for the initial time $t = 0$, only because this initial data is given and learned directly for RF and CFM. GeONet is the only framework among the comparison which encapsulates the geodesic behavior to a considerable degree. While the other methods are suitable for point cloud representations, the geodesic behavior is entirely lost and highly inaccurate to ground truth. Second, RF and CFM have the same fixed resolution as the input probability distribution pairing, while GeONet can be smoothed out for estimating the density flows on higher resolution than the input pairing (cf. the third row in Figure 4). %Furthermore, this data was collected from these alternative optimal transport frameworks, and so our choice of data exists among other literature as well, making GeONet applicable to data found in experiments elsewhere.

Based on this new experiment, we also revise our statement for more recent work to add suggested references.


\medskip

{\it The experiments are all extremely toy with limited examples and dimensionality, with the largest experiment being on a small subset (5000) of MNIST on a 30 dimensional embedded space.}

{\it Why is error calculated in the encoded space for MNIST? It would be much more meaningful to calculate error in the ambient space. The error in the encoded space is difficult to understand and an unreproducible metric, particularly given the lack of code.}

{\bf Reply.} Regarding the MNIST experiment, in the revision, we retrained GeONet on the full MNIST data, splitting with 30,000 training data for $\mu_0$ and 30,000 training data for $\mu_1$ and randomly pairing them together in training. New testing error in the $L^1$ metric is reported in Table 4, consisting of both the error in the encoded space and ambient pixel space. Comparing the encoded-space error in the initial submission (Table 3 therein), we see that increasing the MNIST data indeed decreases the testing errors. Furthermore, the encoded data was also normalized prior to computing the geodesics with GeONet, which likely also contributed to the decrease in error. Moreover, as expected, the ambient-space error is much larger than the encoded-space error, meaning that the geodesics in the encoded space and ambient image space do not coincide. We mentioned this in the *LIMITATIONS* subsection 4.5 (initial submission) that encoding strategy seems to be necessary with such a dataset. In addition, we included the source code for the full MNIST experiments in the supplementary material for our revised submission.

%We found the encoding had drastic performance gains over attempting to learn the geodesic in the ambient space directly. The geodesic in the encoded space does not have an exact translation to the geodesic in the ambient space, which we have since highlighted in the paper in the limitations section for clarity. We originally included this detail in the appendix. The lack of exact translation could possibly be remedied by additionally encoding ground truth geodesics, and finding encoded solutions that encompass the initial conditions as well as the intermediate times, but this defeats the purpose of GeONet in the sense that the geodesic can be solved with zero input data aside from the endpoint distributions. We use the same error metric as our previous experiments, as the L1 metric, computed at predetermined times t=0,0.25,0.5,0.75,1. With this experiment, the geodesic in the encoded space can be used for reasons aside from needing the geodesic in the ambient space, and so we hope the experiment is not nullified in value for this reason. In the rebuttal, we included our code for the MNIST experiment in the supplementary material.


\medskip

{\it Why is the ground truth for the Gaussian mixture approximated using Convolutional Wasserstein Barycenters? I’m actually fairly surprised the error is so low, especially for low regularization values.}

{\bf Reply.} Convolutional Wasserstein Barycenters (CWB) is most suitable for computing the OT maps and dynamics on meshes over large domains in the continuous (non-point cloud) setting, and favorable timing and numerics over other algorithms (linear programs, Sinkhorn, iterative projections) have been reported in [4]. Given the discretization size in our problems, we used off-the-shelf CWB solver in the standard POT Python library to compute the Wasserstein geodesic (since the barycenter is a special of $t = 0.5$). There may be numerical issue when the regularization parameter is smaller than the resolution of the domain discretization because the convolution kernel is ill-conditioned; but similar issues would also occur for other entropic regularized algorithms with low regularization values. In our experiments, it appears that the ``ground truth" geodesics computed by CWB is entirely reasonable (cf. Figures 3, 4, 7 in the revised version).

We presume you are referring to the errors between GeONet and the CWB framework. Since CWB acts as the best 2D geodesic calculator in the continuous setting, it is reasonable for comparison to GeONet, yielding such errors.

\medskip

{\it The $L^1$ error is also twice the Total variation distance. I find it somewhat strange to use TV here, usually Wasserstein or MMD are used, but I guess this is okay for toy problems.}

{\bf Reply.} We choose the $L^1$ error as our primary performance measure because it is a bounded metric. This way, we can interpret this error for different probability distributions across simulation setups. To accommodate your concern, we also include the errors (for the same simulation setups) in terms of $L^2$ distance and Wasserstein distance in Appendix J, Table 6 in the revised version. We see that the $L^2$ and Wasserstein errors exhibits similar patterns as the $L^1$ error, except that it is difficult to interpret the magnitude of the $L^2$ and Wasserstein errors. %In particular $L^1$ error is highly intuitive and able to be understood. We also remark in addition that $L^p$ errors are standardized among neural operator and physics-informed deep learning literature, and so we found a metric of this variety suitable for us as well.

\medskip

{\it Zero shot super resolution can be done by many modern methods. I’m not sure this is a useful experiment without significantly more experimentation and comparison. Perhaps on benchmarks that are constructed with a known map? (see [6]).}

{\bf Reply.} We highlight the zero-shot super resolution is consequence of the operator learning (from functions to functions) nature of our method and a feature that is not present in many other existing learned-based [2, 3, 5, 6] and (more traditional) optimization-based Wasserstein geodesic methods. As such, we do not claim the state-of-the-art of super-resolution problems in compute graphics. Given the limited time period for rebuttal, we do not intend to pursue in this direction.

%and was not an actual experiment in which error or anything of the like was computed. We also remark that the zero-shot super resolution would presumably achieve similar errors to those we found in the paper if the higher-resolution geodesics were known. Furthermore, while zero-shot super resolution has become more frequent in general literature, it is not a feature of traditional numerical solvers and is specific to our method.

\medskip

{\it “x-axis is the log of grid length in one dimension. This is somewhat confusing (also which log base?). Can this be replaced by the actual grid length?}

{\bf Reply.} The log base is the natural logarithm with based $e$. The reason we chose a log-log plot for our runtime comparison is that a log-log plot is capable of displaying a trend when there is a difference in order of magnitude, as exhibited by the linear patterns. According to your suggestion, we added a Figure 7 in Section 4.5 in revision (to replace Figure 7 in Section 4.4 in initial submission) to include plots for runtime vs. discretization length on both non-log and log-log scales.

%A runtime comparison without logs taken would perhaps be more easily understood, but does not display the trend we were trying to convey.

\medskip

{\it While I can see the usefulness of amortizing W2 computation for faster inference, I do not think the comparison in 4.4 is fair. I would be interested to know for a comparable accuracy, how fast POT and GeONet are, as I assume the POT solver is extremely accurate. Or, similar to shown in MetaOT, a comparison showing GeONet provides a better initialization and speeds up convergence of Sinkhorn-based solvers.}

{\bf Reply.} Thanks for your suggestion! In the revision, we replaced Figure 7 in Section 4.4 by Figure 6 in Section 4.5, which compares the GeONet inference time and two versions of POT: one solves the Wasserstein geodesic problem to the machine precision, and the other solves the same problem with early stopping to the GeONet precision level. The left two panels in Figure 6 (revised version) display the runtime and discretization length (not on the log-scale), and the right panel displays the log-runtime and log-discretization-length to show the order of magnitude differences. The added experiment suggests that: the computational gain of GeONet over both versions of POT is more sizable for higher-dimensional and larger-size problems. The only exception for the reduced accuracy of POT beating GeONet is 1D Gaussian mixture with coarse discretization, a scenario that is not realistically interesting.

\medskip

{\it It could be helpful to include a comparison to non-amortized methods to make clear under what circumstances amortization becomes beneficial.}

{\bf Reply.} We addressed the problem in the reply to a previous question on comparison to non-amortization methods. Specifically, we added comparison with two more suggested learning-based methods: the rectified flow (RF) [2] and conditional flow matching (CFM) [3] in the revision.

\medskip

{\it The method suffers from the curse of dimensionality as it currently requires fixed sized grids as input. It would be interesting to consider more general input forms.}

{\bf Reply.} Our method only suffers from {\bf input} curse-of-dimensionality in the branch network, and not {\bf output} in the trunk network. This suggests that there is no effect on inference time for fine grids of which the output is to be evaluated. Moreover, the input suffers the curse-of-dimensionality only partially: this could be mitigated by using alternative neural network frameworks, such as convolutional neural networks, for the branch networks.

\medskip

[2] Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. {\it ICLR} 2023.

[3] Alexander Tong, Nikoly Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Kilian Fatras, Guy Wolf, Yoshua Bengio. Improving and Generalizing Flow-Based Generative Models with Minibatch Optimal Transport. 2023.

[4] Solomon et al. (2015) Convolutional wasserstein distances: Efficient optimal transportation on geometric domains. {\it ACM Trans. Graph.}

[5] Julien Lacombe, Julie Digne, Nicolas Courty, and Nicolas Bonneel. Learning to generate Wasserstein barycenters, {\it Journal of Mathematical Imaging and Vision} 2021.

[6] Brandon Amos, Giulia Luise, Samuel Cohen, and Ievgen Redko. Meta Optimal Transport, {\it ICML} 2023.

\section*{Reply to Reviewer Q478}

{\it Basically the topic and presentation look excellent, however it lacks of theoretical analysis for example whether the parametric method is appropriate given its complexity of the dynamic geodesic.}
%{Another weak point is it lacks of more examples in application. See my point in questions.}

{\bf Reply.} We acknowledge that this paper does not deal with the theoretical aspect of statistical guarantees. We expect that some theoretical work can be done under certain assumptions given the well-known regularity of the (static and dynamic) OT problems. For instance, if the initial and terminal distributions are smooth, then Caffarelli's global regularity ensures that the static OT map between the two distributions is also smooth (to a less degree). Then with reasonable assumptions on the convexity and smoothness of the optimal potential for pushing mass along the Wasserstein geodesic, we expect that there is a stability between the GeONet loss function and the underlying Wasserstein geodesic, leading to a reasonable generalization error bound that is useful to control the predictive risk of geodesic in the test time. Given the experimental nature of this paper introducing a new concept of Wasserstein geodesic from an operator learning perspective, we leave the theoretical analysis to the future work.

\medskip

{\it The design is based on the assumption that we fully know the endpoint densities, however in real-world applications, instead of density, we only have a set of samples for each density. How does the approach cope with such cases?}

{\bf Reply.} We have since remedied this by the introduction of a new experiment on empirical densities constructed from point clouds sampled from Gaussian mixtures. The primary purpose of this experiment is to present a comparison to other methodology, notably rectified flow (RF) [2], and conditional flow matching (CFM) [3]; however, we also remark that this experiment presents GeONet as suitable for a point cloud setting, if point clouds are made into empirical densities. The process can be reversed, by sampling points according to the generated geodesic densities. While GeONet is not a framework for learning the direct translocation of points, we hope this experiment is satisfactory is illustrating effectiveness in the discrete setting.

\medskip

{\it It is not clear to me how the endpoint densities are defined in MNIST dataset experiment. I thank you provide the code for the experiment on Gaussian Mixture densities.}

{\bf Reply.} We thank you for your appreciation of our code. We use an autoencoder framework to encode the entire MNIST data, as a $28 \times 28$ grid made into a $32$-dimensional vector, in which a decoder subsequently maps the encoded representation back to its original image. There are no restrictions on the encoding, and so the encoded representation naturally converges to highly irregular data. These encoded representations are shifted and normalized to be made into densities. The geodesic is then learned between such encoded densities. We revisited this experiment this rebuttal, and ensured the encoded representations were made into densities, which lowered error by several percentage points for all times, in addition to including the full MNIST dataset is used for training. In the revision, we also included the source code for the full MNIST experiments in the supplementary material to reproduce the discretization of the endpoint densities in MNIST data experiment.

\section*{Reply to Reviewer oWAG}

{\it The application of neural networks to compute Wasserstein geodesics is interesting. The accuracy is still a big issue compared to classical mesh-based approaches. See
Fu, et.al., High order computation of optimal transport, mean field planning, and potential mean field games. Journal of computational physics, 2023.}

{\bf Reply.} We agree with you that mesh-based methods are more accurate than learning-based methods. In general, amortized inference are based on reasonable assumptions on data and model (such as no distribution shift in the test time and the model capacity). Mesh-based approaches can accurately solve the Wasserstein geodesic (and other mean-field control/game/planning) down to machine precision (i.e., optimization error), while learning-based methods can only solve up to statistical (coming from data) and approximation (coming from model) errors. On the other hand, learning-based methods are often faster (and sometimes much faster) than mesh-based methods on higher-dimensional and larger-size problems.

\medskip

{\it Some more interesting computational examples in mean field control problems can be considered in future work. See
Lin, et.al. Alternating the Population and Control Neural Networks to Solve High-Dimensional Stochastic Mean-Field Games, PNAS.}

{\bf Reply.} We thank you for pointing the interesting direction to make connection to mean-field control problems. The mean-field control problems are very much related to the dynamical OT formulation, both of which can be formulated as a convex-concave primal-dual saddle-point learning (and optimization) problem. On the other hand, there are two main differences. First, initial density and terminal state are required boundary conditions for mean-field control, while Wasserstein geodesic requires the boundary condition as fixed two densities. Second, the dynamical OT problem lacks of the interaction term, which is important in the mean-field control. Thus, it seems that mean-field planning is an intermediate problem which we believe can be studied as an operator learning problem in a similar manner as learning the Wasserstein geodesic. However, learning the dynamics of a general mean-field control is likely harder than dynamical OT problem and we leave it to the future work. We include this point in the *LIMITATIONS* Section in the revision.



\end{document}