Physics-Aware Difference Graph Networks for Sparsely-Observed Dynamics

It is well known that the central problem of the whole of modern mathematics is the study of transcendental functions defined by differential equations

-Felix Klein

Real world dynamic systems are often modelled using differential equations. Solving these equations give insights to the underlying behaviour of the system. The characteristic of such systems are complex and difficult to capture using Ordinary Differential Equations (ODEs), as the behaviour depends on different factors. Partial Differential Equations (PDEs) are thus used to model the dynamics of a complex physical system. PDEs are used widely in heat-flow studies, stress-strain relationships, electrodynamics, wave equations in sounds and much more. PDEs can be solved using analytical as well as approximate methods. The work on solving PDEs analytically goes back to 18th century when Joseph Fourier studied the heat conductance equation, which later went on to become the core for Fourier transform.

Not all ODEs and PDEs can be solved in a closed form manner and thus numerical methods evolved around this fact. Numerical methods are used by the modern day computers to solve complex PDEs, for which the closed form solution is difficult to obtain. Numerical or approximate methods of solving differential equation dates back to the time of Newton and Euler where a diffferential equation was solved by approximating it as a difference equation. Approximating a partial derivative in the form of difference equations leads to a numerical error, as derivatives are defined on a continuous domain and difference equations are defined on a discrete domain.

In this blog, we try to review and summarize the work on Physics-Aware Difference Graph Networks which tries to discover the underlying dynamical relations between temporal and spatial differences through sparse sequential observations (more on this as you proceed). This work also applies approximate difference operators on graph signals and solves PDE using sparse data. The contents of blog are as follows:

Modelling physical systems

Physical systems have an inherent property, connectedness. There is an underlying structure in these systems which can be modelled. We tend to observe the physical system by measurements, and try to make sense of the behaviour. These observations or measurements are inter-related to each other and there is a underlying structure which can be modelled very well using graphs.

An image can be considered as a graph of pixels, a speech sequence can be considered as a graph of frames, a process can be considered as a graph of state, social networks can be thought of as a graph of users connected with each other and so on. Each node can have different properties, like pixel intensities for a pixel-node in image-graph. Since, the nodes are connected in some way any affect on one of the nodes also tend to have an impact on the neighboring nodes. These interactions are very important to model the overall behaviour of the physical system.

Fig. 1. A facial image can be considered as a collection of facial muscles interconnected with each other. Influence of one muscle is visible on other neighboring muscles as well. (Image source: Imotions)

In real world scenarios, the observations are made using equipments and we usually lack resources or apparatus for measuring many such phenomena due to expensive setup. Some examples of physical phenomena are climate observations, heat-flow, diffusion of gases, weather data and many more.

In weather or meteorological setup we donot have access to a lot of data, as it is impossible to measure temperature at each and every point in this world, instead having weather stations might help as they can act as the representative of that particular region. This indicates that we have to deal with sparse data. PDEs usually capture the behaviour accurately when the data is assumed to be from a continuous domain. While solving PDEs on sparse data, a numerical error is incurred since the derivative can only occur in continuous domain. This error is known as discretization error which happens due to sampling the continuous function. Further these errors can very well affect our understanding on modelling the dynamics of physical system.

As the observations become highly sparse and irregular in nature it becomes more hard to build efficient Deep learning models on top of them. Why would we do that? Because with dynamic systems we would like to predict what is going to happen. Predicting the future has always been tempting and if done accurately can save us a lot of resources and unnecessary efforts. Imagine if we can predict the temperature at a particular location for the whole region, then we know before hand that installing solar panels in regions with high temperature will be cost effective and efficient.

Coming to the sparse nature of the data, these sparse data observations have spatial relations but these observations also vary with time (dynamic systems). The temporal relation amongst these sparse data can be very helpful in predicting properties in near as well as far-future.

Fig. 2. Temporal graph, a graph which evolves with time. Here each graph is having spatial relations amongst the nodes and the node properties change with time creating a temporal relation. (Image source: Blog on Temporal Graph Networks)

To summarize, if we want to understand underlying behaviour of complex dynamic systems we should:

  1. Model the sparse data observations using graph (nodes, edges)
  2. Approximate original PDEs in continuous domain to difference equations in discrete domain. This will incur a discretization error.
  3. Solve PDEs on these graphs defined on sparse data using difference operators. We have to define operators which will act on sparse data (analogous to derivatives).
  4. Predict the properties of system by treating them as graph signals and leveraging both spatial and temporal differences.

Graph Signals and Fundamentals


A graph signal is a mapping from set of nodes or edges to real space. Say, we have a graph \(\mathcal G = (\mathbb V, \mathbb E)\) where \(\mathbb V =\{1,2,...N_{\mathbb V}\}\) is the set of vertices or nodes and \(\mathbb E = \{(i,j) | i,j \in \mathbb V\}\) is the set of edges then graph signal can be defined in two ways.

Readers can go through [Shuman et al] for an awesome introduction on graph signal processing.

Graph signals on node

Consider for $i^{th}$ node in $\mathbb V$ there exists a function $f_i:\mathbb V \rightarrow \mathbb R^d$, where $d$ is the dimension of graph signal, then the graph signal on all nodes at time $t$ is defined as set of all such $f_i$ ‘s. \(f(t) = \{f_i(t)|i\in\mathbb E\}\)

Fig. 3. Sample graph signal defined on nodes, here $\mathbb R^N$ can also be written as $\mathbb R^{N\times d}$ to represent multidimensional graph signal on nodes.
(Image source: Blog on GSP)

Graph signals on edge

Similarly, graph signals can also be defined on edges, \(F(t) = \{F_{ij}(t)|(i,j)\in\mathbb E\}\) where \(F_{ij}:\mathbb E\rightarrow \mathbb R^p\), here for simplicity p can be used to represent multi-dimensional graph signal on edges.

Gradient on graphs

Gradient of graph signal on nodes (aka function on nodes of a graph) is represented by a finite difference. This function is from hilbert space on nodes to hilbert space on edge. Finite difference of graph signals becomes edge features.

\[{\nabla:L^2(\mathbb V) \rightarrow L^2(\mathbb E) },\\ (\nabla f)_{ij} = (f_j-f_i) \forall (i,j) \in\mathbb E, \text{otherwise 0}\]

Readers are advised to go through [Crane K] for an introduction on discrete differential geometry which involves the study of difference operators defined on graphs.

Laplace-Beltrami Operator on graphs

Laplacian in graph domain is a function from Hilbert space of node functions to another Hilbert space on node functions. This becomes the laplacian matrix $L = D - A$, where $D$ is a diagonal degree matrix and $A$ is adjacency matrix.

\[{\Delta:L^2(V) \rightarrow L^2(V) },\\ {(\Delta f)_i = \displaystyle\sum_{j:j\neq i}(f_i - f_j) \space \forall i,j \in V}\]

Difference operator on Triangulated Mesh

The operator on triangulated mesh not only considers gradient and laplacian but also the geometry of the surface through the angles that arrive at the mesh. The Finite Element Method discretization is given below:

\[(\Delta f)_i = \frac{1}{2} \displaystyle \sum_{j:(i,j)\in\mathbb E}(\cot\alpha_j + \cot\beta_j)(f_j - f_i)\]

where node $j$ belong to $i$’s immediate neighbors $(j\in\mathbb N_i)$ and $(\alpha_j,\beta_j)$ are two angles on the edge $(i,j)$

Previous Works

Assumption: Densely distributed data or input is on a continuous domain, if the data is sparse we cannot consider the input to be on a continuous domain.

  1. Physics-Informed Neural Networks (PINN) [Raissi et al.] learns non linear relations between input $(x,t)$ where $x$ is spatial coordinates and output simulated with a given PDE.

  2. Previous works like [Chen et al.] also implement methods that incorporate prior domain knowledge with data driven approach. These methods can not capture the overall continuous nature of the domain as in real world scenarios we have a limited number of points.

  3. Some works like [Long et al.] also try to learn the underlying PDEs, but these works are good for regular grid structures only.

  4. Data first approaches for regular grid data have proved to be efficient in dealing with sparsity, which completely ignore physical equations of the dynamic systems.

Difference equations are very important in physical systems, as difference between the quantities hold useful information about the physical systems like Navier Stokes’, Traffic Flow and so on. In images, weighted differences are used to detect edges from images.

An example of system of coupled non-linear difference equations is given below:

\[y[n] = nx[n] + \cos(x[n]) \\ y[n+1] = e^{y[n-1]}\]

Contributions of this work:

  1. Spatial Difference Layer based on GNNs, to exploit neighboring information in sparse data points.(Learning localized spatial features using sparse data) GNNs are used as they can leverage structural information.
  2. Recurrent Graph Network (RGN) layer after SDL, to learn the temporal difference.
  3. PA-DGN as an efficient method to approximate directional derivatives and predict graph signals in synthetic data. Experimentation to predict climate observations from weather stations, PA-DGN outperforms other baselines.

Network Architecture

To exploit neighboring information to learn finite differences inspired by physics equations, a novel architecture PA-DGN is proposed. PA-DGN leverages data-driven end-to-end learning to discover underlying dynamical relations between spatial and temporal differences in given sequential observations.

PA-DGN consists of 2 modules, a Spatial Difference Layer (SDL) and a Recurrent Graph Network (RGN). Let us look them one by one.

Fig. 4. PA-DGN architecture with SDL and RGN modules. (Image source: Seo et. al)

Spatial Difference Layer (SDL)

As the data is sparse, while considering the spatial features only relying on the neighboring nodes will be inaccurate. To overcome this., SDL Layer aims at learning a difference operator which is a combination of both the gradient and Laplacian operator on the graph signal, to utilize the neighboring information. The learnable difference operator has the following representation:

\[{(^w\nabla f)_{ij} = w_{ij}^{(g_1)}(f_j - w_{ij}^{(g_2)}f_i)}\] \[{(^w\Delta f)_{i}} = {\displaystyle \sum_{j:(i,j)\in \mathbb E} w_{ij}^{(l_1)}(f_i - w_{ij}^{(l_2)}f_j)}\]

The parameter $w_{ij}$ is used to tune the difference operators with corresponding edge direction ${\textbf e_{ij}}$. The superscript in the parameters represent aggregation functions which is the core of any GNN based architecture.

\[w_{ij} = g(\{f_k, F_{mn}\space|\space k, (m,n) \in h-\text{hop neighborhood of edge} \space e_{ij}\})\]

$w_{ij}$ become the edge features of the output of aggregating function. This combination of gradient and laplacian operators result in a general difference operator which can be used in various other scenarios like sharpening, edge detection, modulating gradients and so on.

Recurrent Graph Networks

After obtaining the spatial differences from SDL layer output in form of updated gradient and laplacian, the modulated graph signal is concatenated with the original graph signal to construct node-wise and edge-wise features. The graph is thus called difference graph. This difference graph has information about the graph signal at time $t$. This graph can be used to predict graph signals in the next time step. To include both spatial and temporal variations the authors use a Recurrent Graph Network layer which takes an input graph state ${\mathcal G_{h}}$, node features $z_i$ and edge features $z_{ij}$. The output is graph state: \({\mathcal {G}_{h}^{*}} = (\mathit{h^{*(v)},h^{*(e)})}\) and the next node features are \(z_{i}^{*}\) and edge features \(z_{ij}^*\).

Update rules:

\[{(z_{ij}^*, \mathit{h^{*(e)})}\leftarrow \phi^e(z_{ij},z_i,z_j,h^{(e)})}\space\forall\space(i,j)\in\mathbb{E}\] \[{(z_{i}^*, \mathit{h^{*(v)})}\leftarrow \phi^v(z_{i},\bar z_i',h^{(v)})}\space\forall\space i\in\mathbb{V}\]

Here $\bar z_i’$ is aggregated edge attribute of node $i$ . $\phi^e,\phi^v$ are edge and node update functions. Target is predicted through a decoder with inputs \(z_{i}^*\) , \(z_{ij}^*\).

Objective Function

Loss function is the squared loss over all node functions and edge functions. The following objective function is minimized:

\[{\mathcal L = \displaystyle\sum_{v \in \mathbb V}||f_i-\hat f_i||^2 + \displaystyle \sum_{(i,j)\in \mathbb E}||F_{ij}-\hat F_{ij}||^2}\]

Effectiveness of SDL

Approximation of directional derivatives

The following experiment was carried out to approximate directional derivative.

Experiment: 2 Synthetic functions were considered with 200 sample points. Underlying graph was created using k = 4. With known gradient at each point, directional derivative can be computed by projecting ${\nabla f}$ to connected edge $e_{ij}$. A total of 4 baselines were considered for the above task and compared.

Fig. 5. Gradients and graph structure of sampled points from synthetic functions. Left one is $0.1x^2+0.5y^2$ and right one is $\sin (x)+\cos(y)$. (Image Source : Seo et. al)


Functions Mathematical expression Remarks
FinGrad \(\frac{f_j-f_i}{\Vert x_j-x_i \Vert}\) No learnable parameters, fixed 2 point approach
MLP Input : \((f_i,f_j,x_i, x_j)\) Input same as FinGrad, but parameters are learnable
GN Edge features: \(d(x_i,x_j)\), Node features: \(f_i, f_j\) Edge feature output is used as prediction for computing directional derivative
One-w \({(^w\nabla f)_{ij} = w_{ij}f_j - f_i}\) Not very robust, as it doesn’t capture all possible combinations of fi and fj
SDL \({(^w\nabla f)_{ij} = w_{ij}^{(g_1)}(f_j - w_{ij}^{(g_2)}f_i)}\) \({(^w\Delta f)_{i}} = {\displaystyle \sum_{j:(i,j)\in \mathbb E} w_{ij}^{(l_1)}(f_i - w_{ij}^{(l_2)}f_j)}\) Better than all baselines

Predicting synthetic graph signals

Experiment: PA-DGN is used on synthetic data sampled from simulation of convection-diffusion equations, to check the prediction performance of simulated dynamics from observations on discrete nodes itself. The equation used for this is:

\[{\frac{df_i(t)}{dt} = a(i)(\nabla f)_{\hat x}+b(i)(\nabla f)_{\hat y}+c(i)\Delta f},\space {f_i(0) = f_o(i)}\]

\(i^{th}\) node coordinate is \((x_i,y_i)\) in the 2D space $([0,2\pi]\times[0,2\pi])$ $\hat x, \hat y$ are directions.

\[a(i) = 0.5(\cos(y_i)+x_i(2\pi-x_i)\sin(x_i))+0.6\] \[b(i) = 2(\cos(y_i)+ \sin(x_i))+0.8\] \[c(i) = 0.5(1-\frac{\sqrt{(x_i-\pi)^2+(y_i-\pi)^2}}{\sqrt{2}\pi})\]

A total of 250 sample points are uniformly chosen. Using these equations, it is possible to predict graph signal values of all points in future M steps, given that we have the previous N steps. N = 5, M = 15 is chosen for this experiment, and graph is created using KNN (K = 4). As the equation is a linear PDE, the SDL layer is cascaded with a linear regression model at the prediction end.

Models Architecture Remarks
VAR 2 lags with input concatenated from previous 2 frames Weights are shared among all nodes
MLP 2 hidden layers network, input concatenated from previous 2 frames Weights are shared among all nodes
RGN 2 GN Blocks, one edge update block and one node update block Both blocks use 2 cell GRU with hidden dimension 73
StandardOP Uses the standard difference operator on graphs Hidden dimensions = 73
MeshOP Uses the triangulated mesh operator on graphs Hidden dimensions = 73

StandardOP, MeshOP, SDL outperform rest of methods which shows that spatial difference information is crucial for prediction.

Prediction: Graph Signals on Land-based weather sensors

The following experiment was carried out using 2 datasets.

Experiment: Predicting Temperature from land-based weather stations located in USA

Data: Weather stations group from western and southeastern states of USA were sampled from Online Climate Data Directory of the National Oceanic and Atmospheric Administration (NOAA) on the basis of how actively they measured meterological observations during 2015. K-NN graph with K = 4 is created and output adjacency matrix was transformed by $A \rightarrow (A+A^T)/2$ to make the adjacency matrix symmetric. Over a timeperiod of 1 year on hourly step basis was aggregated from all the stations and was split in the following way: 8 months data for Training, 2 months data for Validation data, 2 months data for Test data. The loss function L is used for training using Adam optimizer and scheduled sampling [Bengio et al]. Two types of predictions are done, 1-step and multi-step. Four baselines are used : VAR, MLP, GRU, RGN [Sanchez-Gonzalez et al]

Fig. 6. Weather stations from western and south-eastern states of USA. (Image Source: Seo et. al)

PA-DGN outperforms RGN as spatial differences and temporal differences are able to capture the predicted graph signal accurately due to the fact that the signal is not dependent on previous signal but also on the neighborhood and their signal values. This leads to a better understanding of the underlying physics of the dynamic system.

Implementation Tricks

  • NVIDIA GTX 1080Ti GPUs were used to run the experiments
  • $h=2$
  • Each experiment was run 3 times and std deviation was computed
  • Synthetic data was trained by taking 5 frames input and predicting 15 frames
  • NOAA dataset was trained by taking 12 frames input and predicting 12 frames
  • Hyperparameters:
\[lr = 10^{-3}\\ \text{batch_size} = 8\\ \text{weight decay} = 5\times 10^{-4}\\ \text{Epochs} = 2000\]

Abalation Study

RGN without spatial derivatives is evaluated on the graph signals from the dataset. RGN is then added with StandardOP (discrete spatial differences from Gradient and Laplacian) and evaluated. RGN is also added with MeshOP (triangular mesh approximation of differential operators) as separate input signals to RGN. The final model evaluated is PA-DGN which used RGN with SDL. PA-DGN shows least Mean Absolute Errors (3.56%, 5.50%, 8.51% and 8.73%, 8.32%, 5.49% on two datasets, respectively)

Take away from the blog

  • Real world systems can be modelled using graphs, sparse data is a challenging frontier to deal with while it comes to solving PDEs and they cannot be applied directly.
  • PDEs can be approximated as Difference equations, incurring an error.
  • To solve both sparsity and approximation problem of PDEs, the work suggests PA-DGN which tries to solve the underlying dynamics by using spatial and temporal differences.
  • Spatial differences were found to be very informative in various experiments carried out,
  • PA-DGN was found to be superior in approximation of directional derivatives and prediction of graph signals on synthetic data and the real-world climate observations from weather stations.
  • Operators with learnable parameters are better than approximated operators.

References

[1]. Seo et al., “Physics-aware Difference Graph Networks for Sparsely-Observed Dynamics”, ICLR 2020.

[2]. Stanković, L., & Sejdić, E. (Eds.). (2019). Vertex-frequency analysis of graph signals. Springer International Publishing.

[3]. Shuman et al., “The Emerging Field of Signal Processing on Graphs”, IEEE Signal Processing Magazine.

[4]. Raissi et al. “Physics informed deep learning (part i): Data-driven solutions of nonlinear partial differential equations.”, arXiv preprint

[5]. Chen et al. “On learning optimized reaction diffusion processes for effective image restoration”, CVPR 2015

[6]. Long et al. “Pde-net: Learning pdes from data”, ICML 2018

[7]. Crane K, “Discrete differential geometry: An applied introduction”, Notices of the AMS, Communication, 2018

[8]. Bengio et al., “Scheduled sampling for sequence prediction with recurrent neural networks”, NIPS 2015

[9]. Sanchez-Gonzalez et al., “Graph networks as learnable physics engines for inference and control”, ICML 2018

[10]. Sandryhaila et al., “Discrete Signal Processing on Graphs”, IEEE Transactions on Signal Processing, 2013