ACCEPTED ARTICLE: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, NOVEMBER 2020
1
CG-Net: Conditional GIS-aware Network for
Individual Building Segmentation in VHR SAR
Images
Yao Sun, Yuansheng Hua, Lichao Mou, and Xiao Xiang Zhu, Senior Member, IEEE
Abstract—Object retrieval and reconstruction from very high
resolution (VHR) synthetic aperture radar (SAR) images are
of great importance for urban SAR applications, yet highly
challenging owing to the complexity of SAR data. This paper
addresses the issue of individual building segmentation from a
single VHR SAR image in large-scale urban areas. To achieve
this, we introduce building footprints from GIS data as com-
plementary information and propose a novel conditional GIS-
aware network (CG-Net). The proposed model learns multi-level
visual features and employs building footprints to normalize the
features for predicting building masks in the SAR image. We
validate our method using a high resolution spotlight TerraSAR-
X image collected over Berlin. Experimental results show that the
proposed CG-Net effectively brings improvements with variant
backbones. We further compare two representations of building
footprints, namely complete building footprints and sensor-visible
footprint segments, for our task, and conclude that the use of
the former leads to better segmentation results. Moreover, we
investigate the impact of inaccurate GIS data on our CG-Net,
and this study shows that CG-Net is robust against positioning
errors in GIS data. In addition, we propose an approach of
ground truth generation of buildings from an accurate digital
elevation model (DEM), which can be used to generate large-scale
SAR image datasets. The segmentation results can be applied to
reconstruct 3D building models at level-of-detail (LoD) 1, which
is demonstrated in our experiments.
Index Terms—deep convolutional neural network (CNN), GIS,
individual building segmentation, large-scale urban areas, syn-
thetic aperture radar (SAR)
I. INTRODUCTION
V
ERY high resolution (VHR) synthetic aperture radar
(SAR) imagery has attracted many researchers in model-
ing and characterization of objects of interest in urban environ-
ments [1]–[7], as it is able to provide data being independent of
sun illumination and insensitive to weather conditions. Such
data source is particularly of interest to studies concerning
This work is jointly supported by the European Research Council (ERC)
under the European Union’s Horizon 2020 research and innovation programme
(grant agreement No. [ERC-2016-StG-714087], Acronym: So2Sat), by the
Helmholtz Association through the Framework of Helmholtz Artiﬁcial Intel-
ligence Cooperation Unit (HAICU) - Local Unit “Munich Unit @Aeronautics,
Space and Transport (MASTr)” and Helmholtz Excellent Professorship “Data
Science in Earth Observation - Big Data Fusion for Urban Research”, and
by the German Federal Ministry of Education and Research (BMBF) in the
framework of the international future AI lab “AI4EO – Artiﬁcial Intelligence
for Earth Observation: Reasoning, Uncertainties, Ethics and Beyond”.
Y. Sun, Y. Hua, L. Mou, and X. X. Zhu are with the Remote Sensing
Technology Institute, German Aerospace Center, 82234 Wessling, Germany,
and also with the Signal Processing in Earth Observation, Technical Univer-
sity of Munich, 80333 Munich, Germany. (e-mails: yao.sun@dlr.de; yuan-
sheng.hua@dlr.de; lichao.mou@dlr.de; xiaoxiang.zhu@dlr.de)
Fig. 1. Illustration of the difference between building semantic segmentation
and individual building segmentation. From left to right: a SAR image, the
result of building semantic segmentation [11], and the result of individual
building segmentation (ours). In the middle image, all buildings are assigned
the same label, while in the right image, each individual building is identiﬁed
as one class.
areas frequently covered by clouds [8] and to applications
of emergency response [9], [10]. However, because of side-
looking imaging geometry and complex backscattering mech-
anism, SAR image interpretation is challenging, especially in
urban areas where severe geometric distortions such as layover
and shadowing further complicate SAR image understanding.
Buildings are the dominant structures in urban regions. The
literature on retrieving information (e.g., footprint and height)
from individual buildings on a large-scale VHR SAR image
is still in its infancy. In [11], [12], buildings are segmented
from large-scale SAR images using deep networks. However,
individual buildings cannot be recognized, due to serious
layover effects on high-rise buildings in urban areas. Fig. 1
shows the difference between building semantic segmentation
results (middle) and our individual building segmentation
results (right) in a SAR image (left). As can be seen, the
latter is capable of not only providing pixel-wise segmentation
masks but also separating building instances. On the other
hand, several works [3]–[5] develop tailored algorithms to
perform accurate analyses for buildings in complex urban
environments, but these methods are limited to be applied for
large-scale areas. In this work, we are interested in individual
building segmentation from SAR images in a large scale. In
what follows, we brieﬂy explain challenges of this task and
review related work.
arXiv:2011.08362v1  [eess.IV]  17 Nov 2020

ACCEPTED ARTICLE: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, NOVEMBER 2020
2
A. Challenges
Interpreting individual buildings in SAR images is highly
challenging, mainly for two reasons. First, intensity values in
SAR images are closely related to material types and structural
shapes of objects. Therefore, consecutive buildings in the
physical world are difﬁcult to be separated from each other
in a SAR image, unless in the presence of obvious material
or structure changes at building boundaries. Second, even if
buildings in the real world are not neighboring, they probably
overlap with each other in the SAR image, which signiﬁcantly
increases the difﬁculty of image interpretation. Fig. 2 shows
two typical urban areas in an optical image (the ﬁrst column)
and a VHR SAR image (the second column). Footprints and
regions of buildings present in the SAR image are marked
with different colors as shown in the following two columns.
It can be seen that some buildings severely overlap in the
SAR image even if their corresponding footprints are not next
to each other.
B. Related Work
Generally, building extraction approaches from SAR data
can be grouped into the following two categories: data-driven
methods and model-driven methods. The former extracts build-
ing features and then deduces building parameters. Two solu-
tions based on this methodology have been developed. The ﬁrst
one makes an attempt at detecting line- or point-like features
ﬁrst and extracting building regions based on these features.
For example, in [2], feature lines are identiﬁed using a line
detector, and layover areas are derived by extracting parallel
edges; in [13], the authors exploit a constant false alarm rate
(CFAR) edge detector for line feature detection and apply a
Hough transform for parallelogram-like wall area extraction;
in [14], [15], bright line segments and regular spaced point-
like features are detected and subsequently grouped to building
footprints; and in [16], the authors extract and combine a set of
low-level features to create structured primitives. The second
solution directly extracts building regions using segmentation
techniques, such as active contour [17], rotating mask [18],
mean-shift [19], and marker-controlled watershed transform
[20]. In model-driven methods, a SAR image or InSAR
phase is iteratively simulated using geometric and radiometric
hypothesis [3], [4], [21]–[25]. The desired building parame-
ters are progressively achieved by minimizing the difference
between simulated and real data.
The majority of related studies are carried out on buildings
with speciﬁc geometric shapes, e.g., rectangular- [26]–[28]
or L-shaped footprints [20], [29], ﬂat [30] or gable roofs
[31], [32], and different heights [32]–[35]. Only a few studies
address the problem of complex-shaped buildings [14], [15].
Furthermore, most studies investigate simple scenarios where a
minimal distance between buildings is required to ensure scat-
tering effects of different buildings do not interfere with each
other [3]–[5], [36]. In complex scenarios, possible overlapping
areas between two buildings are usually assigned to one
building [6], [37], which may cause incorrect estimations. By
using a SAR tomography (TomoSAR) point cloud, Shahzad
et al. [38] extract buildings without imposing constraints on
Optical image
SAR image
Footprints
Buildings
Fig. 2. Two typical urban areas shown in an optical image and a SAR image.
In column 3 and 4, footprints and the corresponding building regions in the
SAR image are marked in different colors for reference. rg and az denote the
range direction and azimuth direction, respectively.
building shapes and study scenarios. However, the TomoSAR
technique [39] requires multiple SAR acquisitions that are
generally unavailable for most areas and for applications with
a stringent time limit, such as emergency response.
In addition to SAR data, some auxiliary data are introduced,
e.g., building outlines extracted from optical images [5], [40]
and footprint polygons obtained from GIS data [6], [41],
[42], for providing exact locations and geometric shapes of
buildings in the real world. As illustrated in Fig. 2, in complex
urban regions, the use of footprints is beneﬁcial for tasks
concerning individual buildings in SAR images. In exploiting
the shape information, sensor-visible footprint segments, i.e.,
near-range segments in footprint polygons that correspond to
sensor-visible walls, are desirable for extracting layover areas
[6], [42]; contrarily, complete building footprints may provide
additional information especially for extracting roof areas of
low-rise buildings [5]. Therefore, it leaves a question on how
footprints can be effectively used. We demonstrate this issue
in this work by comparing results from both the footprint
utilizations.
In recent years, deep neural networks have been becoming
increasingly popular and shown success in remote-sensing data
analysis [43]–[52], including a wide range of applications
using SAR data, such as classiﬁcation [53]–[57], segmentation
[58], [59], target recognition [60]–[63], and change detection
[64]–[66]. Instead of relying on hand-crafted features, deep
networks can learn effective feature representations from raw
data in an end-to-end fashion. But one problem of applying
deep networks to urban SAR analysis tasks is the lack of
annotation data. To address this issue, Wang et al. [67] take
building polygons from the OpenStreetMap (OSM) dataset and
an ofﬁcial map as ground truth data and train a network to
segment buildings in an urban scene. For building footprint

ACCEPTED ARTICLE: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, NOVEMBER 2020
3
Sensor-visible Scene Modeling
Mask Creation
Individual Buildings (UTM)
Individual Buildings (SAR)
Building Distinction
DEM
Coordinate Transformation
Data Preparation 
U(UTM coordinate system)U
MDataset Generation 
U  (SAR image coordinate system)U
GIS
points
polygon
points
polygon
buildings
complete 
building 
footprints
sensor-visible 
footprint 
segments
Building Masks  &  Footprint Masks   
Fig. 3. The workﬂow for dataset generation. We ﬁrst collect DEM and GIS data in the UTM coordinate system and then project them to the SAR image
coordinate system in order to generate building ground truth annotations and the corresponding footprints in our study area.
extraction, Shermeyer et al. [68] present a multi-sensor all
weather mapping (MSAW) dataset containing airborne SAR
images, optical images, and building footprint annotations,
along with a deep network baseline model and benchmark.
However, in these two works, building footprints, instead of
building areas, are learning targets. By introducing a To-
moSAR point cloud, Shahzad et al. [12] are able to acquire
accurate building areas in a SAR image and take them as
ground truth annotations to train a segmentation network for
the purpose of building extraction. However, this work cannot
differentiate individual buildings. As our survey of related
work shows above, there is a paucity of literature on using
deep learning for VHR SAR image interpretation in complex
urban areas, particularly aiming at segmenting individual and
overlapping buildings.
C. Contributions
In this work, we intend to segment individual buildings in
a large urban area by exploiting SAR images and building
footprints. For the training of models, we generate pixel-
wise ground truth annotations from an accurate DEM. And
building footprints are acquired from GIS data. Afterwards,
a novel conditional GIS-aware network (CG-Net) has been
proposed to ﬁrst learn multi-level visual features and then
employ GIS building footprint data to normalize these features
for predicting ﬁnal building masks. In addition, we compare
two representations of building footprints, namely complete
building footprints and sensor-visible footprint segments, aim-
ing to ﬁnd out a more suitable representation way for this task.
The main contributions of this paper are in four-fold:
• We propose a workﬂow for the segmentation of individual
buildings in VHR SAR images with GIS data. To our
best knowledge, this is the ﬁrst time that individual
buildings are studied on a large-scale SAR image, and
deep networks are employed in the problem of individual
building segmentation of SAR images.
• We propose a network termed as CG-Net, which is
capable of signiﬁcantly improving the performance of
networks for our task by imposing constraints on the
learning process.
• We investigate the impact of inaccurate GIS data on CG-
Net and ﬁnd out that CG-Net is robust against positioning
errors in GIS data. This study suggests that the large
amount of open sourced GIS data can be exploited for
individual building segmentation in SAR images.
• We propose a ground truth generation approach to pro-
duce building masks using an accurate DEM. We believe
that our method can provide large potential in analyzing
complex urban regions.
The remainder of this paper is organized as follows. The
detailed procedure of the dataset generation is presented in
Section II, and the proposed CG-Net is delineated in Section
III. Section IV introduces the conﬁguration of experiments and
analyzes results. Section V demonstrates an application using
the produced segmentation results. In Section VI, we conclude
this paper.
II. DATASET GENERATION
A. Overview
Building annotations (as ground truth data) and building
footprints (as input data) in SAR images are necessary for
training our network. For this reason, we propose a work-
ﬂow that employs a highly accurate DEM and GIS building
footprints to automatically label building masks and their cor-
responding footprints in SAR images. Our dataset is generated
in two stages. First, sensor-visible 3D building models (i.e.,
non-occluded roofs and facades) and building footprints are
prepared in the UTM coordinate system. Second, they are

ACCEPTED ARTICLE: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, NOVEMBER 2020
4
Pdem
Pcom
Psvs
Fig. 4.
Illustration of scene modeling steps with a simulated DEM in 3D
(ﬁrst row) and 2D (second row): (left) the DEM point cloud Pdem; (middle)
the complete point cloud Pcom after adding vertical points; (right) the sensor-
visible point cloud Psvs after hidden point removal.
buildings
trees
Fig. 5. Illustration of 2.5D (dark blue) and 3D (green) surface models. In 2.5D
representation, each 2D point (x, y) is assigned to a unique height value z.
Therefore 2.5D DEM can represent vertical walls of buildings, but not vertical
surfaces of complex objects, such as trees.
projected to the SAR image coordinate system in order to gen-
erate building ground truth annotations and the corresponding
footprints. Fig. 3 illustrates the workﬂow, and for more details,
refer to the following sections.
B. Data Preparation in the UTM Coordinate System
1) Sensor-visible Scene Modeling: We ﬁrst model a scene
that can be viewed by a radar sensor in the UTM coordinate
system. The procedure is conducted in three steps (cf. Fig. 4):
• DEM is transformed to a point cloud Pdem. Speciﬁcally,
each pixel in the DEM with geolocation coordinates
(x, y) and a height value h is represented as a point
with coordinates (x, y, h), and hence all pixels establish
a nadir-looking 3D point cloud Pdem.
• A complete 3D point cloud Pcom is generated by ﬁll-
ing vertical data gaps. To be more speciﬁc, vertical
structures such as building walls that are absent from
Pdem are added through the following steps. We ﬁrst
detect building points which are located at height jumps.
Afterwards, at each detected point g(x, y, h), a vertical
point set G = {gi(xi, yi, hi)|i = 1, ..., m} is added,
where xi = x, yi = y, hi = h0 + i × hstep, hi < he.
h0 and he are the minimum and maximum heights in the
neighbourhood of g, hstep is a predeﬁned height step, and
the number of points m = (he −h0)/hstep. Eventually,
a complete 3D point cloud Pcom is built by all vertical
point sets and Pdem. Note that the DEM is 2.5D instead of
true 3D, i.e., each 2D point (x, y) is assigned to a unique
height value z [69], that vertical surfaces of complex
objects are not represented, such as trees (cf. Fig. 5).
Therefore vertical points are only added to building areas
in this step.
• A sensor-visible scene point cloud Psvs is obtained
through a visibility test on the point cloud Pcom. Since
𝑟  
range 
(𝑏) 
(𝑎) 
(𝑐) 
azimuth 
Complete Building 
Footprints 
 
Sensor-visible 
Footprint Segments 
 
Fig. 6. Examples of (top) the visibility test of building footprints and (middle
and bottom) the two footprint representations. (a) and (b) show footprints of
isolated buildings: red edges are sensor-visible, as the angle δ between the
outward normal vector of an edge −
→
n and the range direction vector −
→r is in
the range of (90◦, 180◦], while green ones are invisible. (c) shows a case that
a footprint is touching another one, hence common edges are sensor-invisible.
a radar sensor only sees one side of a scene, points on
the other side should be removed. To this end, the hidden
point removal (HPR) algorithm [70] is applied.
In our process, the viewpoint in HPR is positioned on the
line of sight of the radar sensor at a large distance away
from the scene, in order to simulate an orthographic view
in the azimuth of the radar sensor. In this way, sight lines
from the viewpoint to objects in the scene are parallel to
each other and orthogonal to the azimuth, enabling HPR
to remove sensor-invisible points.
2) Building Distinction: In this step, we distinguish build-
ing points1 for individual buildings. Given one building, its
building points are selected from Psvs using its footprint. Note
that there are two possible inconsistencies between the DEM
and GIS data. First, if a building is contained in Psvs but
not in GIS data, it is not selected from Psvs. Second, if a
building is contained in GIS data but not in Psvs, i.e., points in
the footprint region are not elevated than surrounding ground
points, we exclude this building from our dataset.
C. Dataset Generation in the SAR Image Coordinate System
1) Coordinate Transformation: The aforementioned pro-
cedures are carried out in the UTM coordinate system, and
in our case, building points generated in the previous steps
should be projected to the SAR image coordinate system;
that is to say, coordinates (x, y, h) need to be transformed
to (range, azimuth). Moreover, building footprints are also
1Building points refer to points in a point cloud that belong to the building
class.

ACCEPTED ARTICLE: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, NOVEMBER 2020
5
Latent Space5
Latent Space4
Latent Space3
𝛾3
𝛾4
𝛾5
𝛽3
𝛽4
𝛽5
classifier
classifier
classifier
Block1
Block2
Block3
Block4
Block5
SAR Image
Sensor-visible 
Footprint Segments
Predicted Buildings
concatenation
Fig. 7. Overview of the CG-Net architecture.
projected to this coordinate system by using ground height val-
ues obtained from the DEM. Generally, the coordinate trans-
formation of the point cloud from UTM coordinate system to
the SAR imaging coordinate system includes iterative solving
Doppler-Range-Ellipsoid equations that can be implemented
with different approaches [71]–[74]. In this work, radar coding
was performed using DLR’ s Integrated Wide Area Processor
(IWAP) [75].
2) Mask Creation: Finally, according to range−azimuth
coordinates of building points, we generate building ground
truth masks, in which buildings are indicated by 1 and
backgrounds are marked as 0. In addition, building footprint
masks in the SAR image coordinate system are also created.
Notably, in order to ﬁnd out an effective way of using building
footprints, we create two representations, namely complete
building footprints and sensor-visible footprint segments. The
latter is generated via a visibility test (see Fig. 6). Formally, let
−→
n be the outward normal vector of a polygon edge, −→r be the
range direction vector, and δ ∈[0◦, 180◦] be the angle between
−→
n and −→r . A polygon edge is sensor-visible if δ ∈(90◦, 180◦],
and if a footprint is touching other footprints, common edges
are invisible because they do not exist in the real world (e.g.,
Fig. 6(c)).
D. Post-processing
Since the used SAR image and DEM are collected at differ-
ent times, there might be inconsistencies resulted from urban
changes, such as building construction and deconstruction.
This leads to inaccurate ground truth data. We cope with the
problem using intensity values of the given SAR image. In
the SAR image, the intensity values are generally larger in
building areas than ground areas. Therefore, a threshold is set
to be the mode of the intensity values of the SAR image, to
exclude buildings of which mean intensity values are smaller
than the threshold.
III. METHODOLOGY
A. Overview
In this work, our goal is to train a network that takes a
SAR image and building footprint as inputs and predicts the
building area associated with the footprint in the SAR image.
Since footprints and visible segments generated from GIS data
can provide precise geometry and location information, we
resort to exploiting such cues in our task and devise a network
module that performs a conditional GIS-aware normalization.
By utilizing the CG module, our network, termed as CG-
Net, can learn feature representations from not only SAR but
also GIS data. Speciﬁcally, we employ VGG-16 [76] as the
backbone of CG-Net to learn multi-level features from SAR
images. Afterwards, outputs of the last three convolutional
blocks are upsampled and fed into the CG module separately.
Meanwhile, footprints or visible segments are imported into
the CG module as complementary inputs in order to yield
ﬁnal predictions. In what follows, Section III-B illustrates
the procedure of multi-level feature extraction. Section III-C
introduces details of our CG module, and Section III-D details
the conﬁguration of our CG-Net.
B. Multi-level Feature Extraction Module
We make use of VGG-16 [76] as the backbone of our
network to extract features from multiple layers, as these
multi-level features help in recognizing buildings with variant
scales. The backbone consists of ﬁve convolutional blocks, and
each of them contains two or three convolutional layers. The
size of their ﬁlters is 3×3. Outputs of all convolutional layers
are activated by ReLU [77], and 2 × 2 max-pooling layers
with a pooling stride of 2 are interleaved among these blocks.
Features learned from deep layers are considered to include
high-level semantics, while those from shallow layers are low-
level. Therefore, in this task, we utilize features learned from
the last three blocks, i.e., Block3, Block4, and Block5 (see
Fig. 7). Afterwards, the extracted features are fed into the CG
module separately.

ACCEPTED ARTICLE: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, NOVEMBER 2020
6
Latent Space
𝛾(𝒎𝑔𝑖𝑠)
𝛽(𝒎𝑔𝑖𝑠)
Sensor-visible 
Footprint Segments
Feature Maps
𝒎𝑔𝑖𝑠
𝒙
ෝ𝒙
Normalized 
Feature Maps
Conv 3x3
Identity mapping
Fig. 8.
Architecture of the proposed CG module. Here, we take the
sensor-visible footprint segments as an example. γ and β are normalization
parameters learned from the sensor-visible footprint segments and used to
normalize input feature maps with Eq. (3).
C. Conditional GIS-aware Normalization Module
An intuitive way to make use of GIS data is to simply
concatenate them with SAR images and then feed them to a
vanilla semantic segmentation network, such as fully convolu-
tional networks (FCN). However, such a method might suffer
from the inefﬁcient use of GIS data and leads to unstructured
predictions (see the third column in Fig. 13). To address
this issue, in this paper, we propose a conditional GIS-aware
normalization module to distill the geometry information of
individual buildings from GIS data and normalize ﬁnal predic-
tions with such information. Formally, let mgis be the mask
of the complete building footprint or sensor-visible footprint
segments with a spatial size of W ×H, and xb denotes feature
maps extracted from the b-th convolutional block. The width
and height of xb are represented as W ′ and H′, respectively.
The number of channels is denoted as C′. We consider a naive
conditional normalization procedure as follows:
ˆxb = γbxb + βb,
(1)
where, γb and βb represent a scale factor and a bias, respec-
tively, and they indicate to what extent xb should be scaled and
shifted. The normalized xb is denoted as ˆxb. A commonly-
used measure of γ and β is to calculate the standard deviation
and mean of xb. Since xb consists of more than one channel,
γ and β are often computed in a channel-wise manner, and
thus, Eq. (1) can be rewritten as
ˆxb,c = γb,c(xb,c) · xb,c + βb,c(xb,c),
(2)
where c denotes the c-th channel of xb and ranges from 1
to C′. This equation can be easily extended to the batch
normalization [78] by computing the standard deviation and
mean of each xb,c in a batch.
In our case, we want to normalize feature representations
learned from SAR images, conditioned on GIS data. Our
insight is that the GIS data imply coarse localization cues, and
their use can guide the network to segment individual buildings
accurately. Therefore, we reformulate Eq. (2) as follows:
ˆxb,c,p,q = γb,c,p,q(mgis) · xb,c,p,q + βb,c,p,q(mgis),
(3)
Latent Space
𝛾(𝒎𝑔𝑖𝑠)
𝛽(𝒎𝑔𝑖𝑠)
Sensor-visible 
Footprint Segments
Feature Maps
𝒎𝑔𝑖𝑠
ෝ𝒙
Normalized 
Feature Maps
Conv 3x3
Identity mapping
𝒙
Upsampling
Conv 1x1
Fig. 9.
Architecture of the ﬁnal CG module. In advance of performing
normalization, the channel of input feature maps is ﬁrst reduced, and the
spatial size is enlarged according to that of sensor-visible footprint segments.
where γb,c,p,q and βb,c,p,q indicate the scale factor and bias
learned speciﬁcally for the pixel located at (p, q) in the c-th
channel of xb. As a consequence, normalization parameters γb
and βb are formatted as matrices with a size of W ′ ×H′ ×C′.
Such a design enjoys an advantage that normalization pa-
rameters are learned in a data-driven manner, and thus these
parameters are expected to be more adapted to xb. As to
the implementation of Eq. (3), we ﬁrst project mgis onto a
latent space through 3 × 3 convolutions and then employ two
convolutional layers to learn γb and βb from the encoded mgis.
Subsequently, the element-wise multiplication of γb(mgis)
and xb is performed, and the output is added to βb(mgis)
pixel by pixel. Fig. 8 illustrates the architecture of our CG
module.
D. Conﬁguration of CG-Net
In order to fully exploit GIS data at multiple scales, we
append three CG modules to the last three convolutional
blocks of the backbone (see Fig. 7). However, a question is
that spatial and channel dimensions of the extracted multi-
level features are inconsistent with those of complete building
footprints/sensor-visible footprint segments. To address this
issue, we upsample these multi-level feature maps to match
the spatial resolution of mgis via bilinear interpolation. Note
that doing so would signiﬁcantly increase the computation
overhead of subsequent operations. Hence we reduce the
number of feature channels through 1 × 1 convolutions and
modify the CG module (see Fig. 9) accordingly. Outputs of
the CG module are squashed into the number of classes, 2, and
added via an element-wise addition operation to produce ﬁnal
segmentation results. Fig. 7 illustrates the architecture of the
proposed CG-Net. Furthermore, we note that the proposed CG
module is in a plug-and-play fashion and is ﬂexible enough
to enhance other semantic segmentation network architectures,
e.g., DeepLabv3. For DeepLabv3, since it already fuses fea-
tures from different layers in its architecture, we simply add
our module right before the last layer.
IV. EXPERIMENTS
A. Data Description
In our dataset, a TerraSAR-X image was acquired in the
high resolution spotlight mode over Berlin with the pixel

ACCEPTED ARTICLE: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, NOVEMBER 2020
7
3.86 
3.87 
3.88 
3.89 
3.9 
3.91 
3.92 
3.93 
3.94 
×105 
5.822 
5.821 
5.82 
5.819 
5.818 
5.817 
5.824 
5.823 
×106 
SAR 
image    
0 
1km 
DEM 
Fig. 10. We show our study area in the UTM coordinate system which is the
interaction between the SAR image and the DEM.
spacing2 of 0.871 m in the azimuth direction and 0.455 m
in the slant range direction. The incidence angle of this SAR
image is 36◦, and the heading angle is 194.34◦. To reduce
speckle effect, the SAR image was ﬁltered using a nonlocal
InSAR algorithm [79]. Besides, building footprints in the study
area were downloaded from Berlin 3D-Download Portal3. In
order to yield ground truth annotations, we use a highly
accurate DEM that was obtained via the stereo processing
of aerial images with a resolution of 7cm/pixel [80]. Fig. 10
illustrates our study region (the intersection area), the SAR
image (yellow rectangle), and DEM (red rectangle). Notably,
only data covering the study region are used for generating
our dataset.
By using the workﬂow described in Section II, building
annotations and footprints are generated. Since we want to
explore how GIS data can be effectively used for individual
building segmentation, these two versions of footprint masks
are produced, namely complete building footprints and sensor-
visible footprint segments. Our dataset therefore contains a
5736 × 10312 SAR image, two versions of footprint masks,
and ground truths of individual buildings.
B. Training Details
In order to train an effective and robust segmentation
network, we crop the SAR image into patches of 256 × 256
pixels with a stride of 150 pixels. Note that patches in-
cluding incomplete footprints or ground truth annotations are
discarded. Consequently, 30056 buildings are remaining, and
each of them has three patches: a SAR image patch, a footprint
patch, and a ground truth mask. Among all buildings, 19434
of them are utilized to train networks, and the others are test
samples. Note that training and test regions do not overlap.
2In SAR images, pixel spacing represents the length one pixel corresponds
to in the real world, while resolution indicates the minimum distance at which
the radar can distinguish two close scatters.
3https://www.businesslocationcenter.de/downloadportal/
The network takes one SAR patch and the corresponding GIS
patch for one building as inputs. After predicting masks of
all buildings, overlapping areas are obtained by overlaying all
masks.
During the training phase, components of the proposed
CG-Net are initialized with different strategies. Speciﬁcally,
the multi-level feature extraction module is initialized with
weights pre-trained on ImageNet [81], and all convolutional
layers in the CG modules are initialized with a Glorot uni-
form initializer. The network is implemented on TensorFlow
and trained on one NVIDIA Tesla P100 16GB GPU for
155k iterations. During the training procedure, all weights
are updated through back-propagation, and we select Netrov
Adam [82] as the optimizer. Parameters of this optimizer are
set as recommended: ϵ = 1e−08, β1 = 0.9, and β2 = 0.999.
The loss is deﬁned as binary cross-entropy, as only two classes
are considered in our dataset, i.e., building segments and
background. We initialize the learning rate as 2e−3 and reduce
it by a factor of
√
10 once the loss stops to decrease for two
epochs. Moreover, we utilize a small batch size of 5 in our
experiments.
C. Quantitative Evaluation
To evaluate the performance of networks, we calculate the
F1 score as follows:
F1 = 2 · P · R
P + R, P =
tp
tp + fp, R =
tp
tp + fn,
(4)
where P and R denote the precision and recall, respectively.
In addition, the intersection over union (IoU) and overall accu-
racy (OA) are also calculated for a comprehensive comparison:
IoU =
tp
tp + fp + fn, OA =
tp + tn
tp + tn + fp + fn.
(5)
tp, fp, tn, fn represent pixel-based true positives, false
positives, true negatives, and false negatives for buildings,
respectively.
In our experiments, we compare four models: FCN, FCN-
CG, DeepLabv3, and DeepLabv3-CG. It is worth mentioning
that FCN and DeepLabv3 are regarded as baselines, and their
inputs are concatenations of SAR patches and their corre-
sponding footprint patches. Both FCN-CG and DeepLabv3-
CG are our proposed networks with different backbones.
Table I reports numerical results of different models on
our dataset, where sensor-visible footprint segments are used.
Comparison of these results corroborates that the proposed CG
module can improve the performance of individual building
segmentation. Speciﬁcally, compared to FCN and DeepLabv3,
FCN-CG and DeepLabv3-CG achieve improvements of 0.75%
and 2.17% in the precision, respectively. Besides, increments
of 1.23% and 1.65% in the mean F1 score and IoU can be
observed by comparing FCN-CG and FCN, while improve-
ments of 0.97% and 1.14% in the same metrics are achieved
by introducing the CG module to DeepLabv3.
Table II presents results of variant models using complete
building footprints. We can see that the results are consis-
tent with those using sensor-visible footprint segments. For

ACCEPTED ARTICLE: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, NOVEMBER 2020
8
a
b
c
d
e
f
g
SAR image
SFS
FCN
FCN-CG
DeepLabv3
DeepLabv3-CG
Fig. 11. Examples of segmentation results using sensor-visible footprint segments (abbreviated as SFS). Pixel-based true positives, false positives, and false
negatives are marked in green, red, and blue, respectively.
example, with the CG module, the precision improves 1.95%
and 3.94% with the backbone, FCN and DeepLabv3, and the
IoU increases 1.50% and 2.16%. To summarize, improvements
achieved by FCN-CG and DeepLabv3-CG demonstrate the
effectiveness of the proposed CG module, and DeepLabv3-CG
can achieve the best performance in all four metrics on our
dataset. Moreover, we note that all models achieve relatively
high OAs, and even the worst model can achieve an OA of
83.40%. This is because OA is computed by considering all
pixels, while non-building pixels, which are easily recognized,
account for a large proportion.
D. Qualitative Evaluation
In addition to the quantitative evaluation, we visualize
several segmentation results in Fig. 11 and 12. Pixel-based true
positives, false positives, and false negatives are presented in
green, red, and blue, respectively.
TABLE I
NUMERICAL RESULTS USING SENSOR-VISIBLE FOOTPRINT SEGMENTS.
THE HIGHEST VALUES OF DIFFERENT METRICS ARE HIGHLIGHTED IN
BOLD.
Model Name
P
F1 score
IoU
OA
FCN
0.6478
0.6808
0.5138
0.8340
FCN-CG
0.6553
0.6931
0.5303
0.9926
DeepLabv3
0.6635
0.6971
0.5351
0.9927
DeepLabv3-CG
0.6852
0.7068
0.5465
0.9928
Fig. 11 shows results of models using sensor-visible foot-
print segments. We can observe a general improvement in
quality from FCN/DeepLabv3 to FCN-CG/DeepLabv3-CG,
especially for buildings in column b, c, and g. For buildings
with simple structures (e.g., the building in column a), all
models are able to offer satisfactory segmentation results,
while for those with complicated shapes (see column e),

ACCEPTED ARTICLE: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, NOVEMBER 2020
9
a
b
c
d
e
f
g
SAR image
CBF
FCN
FCN-CG
DeepLabv3
DeepLabv3-CG
Fig. 12.
Examples of segmentation results using complete building footprints (abbreviated as CBF). Pixel-based true positives, false positives, and false
negatives are marked in green, red, and blue, respectively.
TABLE II
NUMERICAL RESULTS USING COMPLETE BUILDING FOOTPRINTS. THE
HIGHEST VALUES OF DIFFERENT METRICS ARE HIGHLIGHTED IN BOLD.
Model Name
P
F1 score
IoU
OA
FCN
0.7045
0.7242
0.5676
0.9932
FCN-CG
0.7240
0.7362
0.5826
0.9935
DeepLabv3
0.7129
0.7337
0.5794
0.9935
DeepLabv3-CG
0.7523
0.7508
0.6010
0.9937
large under-segmentation areas (cf. red pixels) can be seen
in predicted building masks. Besides, the utilization of the
proposed CG module can effectively reduce over-segmentation
in ﬁnal predictions.
Fig. 12 presents results of models using complete foot-
prints. They indicate that our CG module can ease both
over-segmentation (cf. blue pixels in column b) and under-
segmentation (cf. red pixels in column e) problems to a con-
siderable extent. Moreover, examples in the third row, column
f and the ﬁfth row, column f show that the connectivity of
segmentation results are disrupted (cf. green pixels), while the
integration of the CG module can alleviate such a problem.
A similar phenomenon can also be seen in column d and g
that exploiting the CG module can enhance the connectivity of
predictions. In summary, the proposed CG module effectively
improves segmentation results.
E. Comparison of Complete Building Footprints and Sensor-
visible Footprint Segments
From Table I and II, we can see that models trained
with complete building footprints surpass those trained with
sensor-visible footprint segments. For instance, DeepLabv3-
CG trained on complete footprints improves the F1 score
and IoU by 4.40% and 5.45%, respectively, compared to that
learned with sensor-visible segments.

ACCEPTED ARTICLE: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, NOVEMBER 2020
10
SAR image
SFS (overlaid on GT)
FCN
FCN-CG
DeepLabv3
DeepLabv3-CG
patch 1
GT
CBF (overlaid on GT)
FCN
FCN-CG
DeepLabv3
DeepLabv3-CG
SAR image
SFS (overlaid on GT)
FCN
FCN-CG
DeepLabv3
DeepLabv3-CG
patch 2
GT
CBF (overlaid on GT)
FCN
FCN-CG
DeepLabv3
DeepLabv3-CG
Fig. 13. Examples of segmentation results from different models on two patches, using complete building footprints (abbreviated as CBF) and sensor-visible
footprint segments (abbreviated as SFS). CBF and SFS are overlaid on the ground truth (GT) to visualize the difference between building footprints and
buildings. Different buildings are plotted in different colors (50% transparency).
Fig. 13 provides segmentation results of two patches using
two versions of footprint masks, and different buildings are
marked in different colors (50% transparency). Note that
individual building masks are predicted separately, and then
masks of buildings in the same patch are plotted together
to visualize the overlapping areas. Here, patch 1 presents a
simple scenario, in which buildings are isolated and show
clear signatures in the SAR image. In this case, all models
can obtain good segmentation results. Patch 2 shows a fairly
complicated scene, where two consecutive buildings exist in
the center (cf. buildings in cyan and blue), and SAR signa-
tures are unclear. Although all networks can still successfully
segment isolated buildings, the two overlapped buildings are
not correctly segmented by models trained with sensor-visible
footprint segments (see the third row of Fig. 13). This is
because the mask of sensor-visible footprint segments for
the building on the left contains only one edge, which does
not provide adequate information. Moreover, we notice that
the overlapping region between these two buildings can only
be well identiﬁed by models trained with complete building
footprints.
Overall, these results suggest that complete building foot-
prints are more beﬁtting for segmentation of individual build-
ings than sensor-visible footprint segments. This may be
because the former delivers more information, especially for
low-rise buildings.
F. Can CG-Net work with inaccurate GIS data?
So far, building footprints used in our experiments are
highly accurate as they are acquired from ofﬁcial GIS data.
However, most openly available GIS data, such as Open-
SteetMap (OSM), often contain positioning errors. To test the
performance of CG-Net in such cases, we conduct supple-
mentary experiments on training our CG-Net with inaccurate
building footprints, and discuss the impact of positioning
errors in GIS data.
First, we generate inaccurate CBF, termed as CBF-E, by
injecting positioning errors. As illustrated in Figure 14, −→e is
the added positioning error, and α is the angle between −→e and
the range direction. According to the quality assessment study
of OSM in [83], the average offset of building footprints is
4.13 m with the standard deviation of 1.71 m. Therefore we
consider the positioning error as a variable whose magnitude
is Gaussian distributed, i.e., |−→e | ∼N(µ = 4.13, σ2 = 1.712).
Since the offset may point to different directions, we assume
the direction of −→e is uniformly distributed, i.e., α is unifor-
maly distributed in the range of [0◦, 360◦). For simplicity, let
α be discrete: α ∼DiscreteUniform(0◦, 359◦). Note that this
is the most difﬁcult case that all footprints contain positioning
errors.

ACCEPTED ARTICLE: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, NOVEMBER 2020
11
Fig. 14. Illustration of generating building footprints with positioning errors.
Positioning error −
→e is added to building footprint CBF, resulting in CBF-E.
rg and az denote the range direction and azimuth direction, respectively. α is
the angle between −
→e and rg.
TABLE III
NUMERICAL RESULTS OF DEEPLABV3-CG TRAINED USING CBF AND
CBF-E.
GIS data used for training
P
F1 score
IoU
OA
CBF
0.7523
0.7508
0.6010
0.9937
CBF-E
0.7221
0.7146
0.5560
0.9927
Then, we train DeepLabv3-CG using CBF-E and SAR
patches, and test the trained network with a clean test set.
DeepLabv3-CG is chosen because it performs best among all
the networks. The parameter settings of the network remain the
same as previous experiments, as described in Section IV-B.
The results are listed in Table III. As can be seen, comparing
to results using CBF, the precision of the network trained on
CBF-E is decreased by 3.02%, the F1 score is reduced by
3.62%, and the IoU is decreased by 4.5%. However it still
gives competent segmentation results. For visual comparison,
Figure 15 shows results of DeepLabv3-CG trained with CBF-
E and CBF. For the building in column c, DeepLabv3-CG
trained with CBF performs much better than that with CBF-
E. However the predictions for buildings in column a and b are
visually very similar. Moreover, we observed that predictions
from DeepLabv3-CG trained on CBF-E are satisfactory for
most buildings.
The experiments show that although weakened by position-
ing errors in GIS data, the proposed CG-Net is robust even in
the most difﬁcult case. This ﬁnding suggests that the large
amount of existing open sourced GIS data, such as OSM,
can be exploited for segmenting individual buildings in SAR
images.
V. FURTHER APPLICATION: RECONSTRUCTION OF LOD1
BUILDING MODELS FROM A SAR IMAGE
Building models can be created at different levels-of-detail
(LoD). According to the terminology of CityGML [84], LoD1
models represent buildings as blocks with ﬂat roof structures
and can be reconstructed by extruding footprints with building
heights. Here, we regard the average roof height as the
building height4. In this section, we demonstrate the process
of reconstructing LoD1 models using our predicted individual
building masks.
4http://en.wiki.quality.sig3d.org/index.php/Modeling Guide for 3D Objects
- Part 2: Modeling of Buildings (LoD1, LoD2, LoD3)
a
b
c
y
SAR image
y
CBF
DeepLabv3-CG
trained on CBF
DeepLabv3-CG
trained on CBF-E
Fig. 15. Examples of segmentation results of networks trained using complete
building footprints (abbreviated as CBF) and networks trained using building
footprints with positioning errors (abbreviated as CBF-E). Pixel-based true
positives, false positives, and false negatives are marked in green, red, and
blue, respectively.
θ 
h
r
l
f
Slant range
θ 
h
r
l
f
Slant range
Fig. 16. The projection geometry of two ﬂat-roof buildings in a slant-range
SAR image. θ is the incidence angle. h is the building height. l, r, and f
denote the length of layover, roof, and footprint areas in a slant-range SAR
image, respectively.
Fig. 16 illustrates the projection geometry of two ﬂat-roof
buildings in a constant azimuth proﬁle of a SAR image. θ is the
incidence angle. l, r, and f denote the length of layover, roof,
and footprint areas in the slant-range SAR image, respectively.
Notably, the building region in the SAR image contains both
the layover and the roof areas. The layover area coincides with
the building region when the building height h is large, e.g.,
the case in Fig. 16 (left), and it is covered by the building
region when h is small, e.g., the case in Fig. 16 (right). In
both cases, the layover area can be calculated by subtracting
the footprint from the building region. Therefore, l is estimated
to be the length of the layover area in the slant-range direction,

ACCEPTED ARTICLE: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, NOVEMBER 2020
12
Fig. 17. Segmentation results in the study area obtained by DeepLabv3-CG. The building segments are plotted with different colors translucently for visualising
the layover areas between buildings. rg and az denote the range direction and azimuth direction, respectively.
and h can be computed with the following equation:
h = l/cosθ.
(6)
From the predicted individual building masks (cf. Fig. 17),
we calculate building heights with Eq. (6). Afterwards, LoD1
building models are created by extruding building footprints
with obtained heights. Fig. 18 presents example LoD1 models
superimposed on the SAR image in the study area. It can be
observed that buildings with large l (pointed by yellow arrows)
are predicted as high-rise, while those with small l (pointed
by red arrows) are reconstructed as low-rise buildings. This
is in line with the reality. We further evaluate the estimated
height against the mean height from the accurate DEM for
each building. The mean height error we achieve in the study
site is 2.39 m. The histogram of height errors is shown in Fig.
19.
VI. CONCLUSION
In this paper, we propose a conditional GIS-aware network
(CG-Net) to segment individual buildings from a large-scale
VHR SAR image. We also propose an approach for generating
ground truth annotations of buildings using a high resolution
DEM. The proposed method is evaluated in Berlin area,
using a high resolution spotlight TerraSAR-X image and
building footprints obtained from GIS data. Both qualitative
and quantitative results demonstrate the effectiveness of the
proposed CG-module. Compared to competitors, DeepLabv3-
CG achieves the best F1 score of 75.08%. In addition, we
compare two building footprint representations, namely com-
plete building footprints and sensor-visible footprint segments.
0 
80
40 
60 
20 
Fig. 18. Example LoD1 building models in the study area superimposed on
the SAR image. Layover areas of some buildings are visible, as pointed by
the yellow and red arrows. Building heights are color-coded.
Experimental results suggest that the use of complete build-
ing footprints leads to better results. Further experiments of
training the networks using inaccurate GIS data suggest that
CG-Net is robust in presence of positioning errors in GIS data.
Additionally, we demonstrate an application of our results,
i.e., LoD1 building model reconstruction. In the future, we are
interested in applying the proposed data generation workﬂow
to areas of various urban morphologies and using our CG-Net

ACCEPTED ARTICLE: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, NOVEMBER 2020
13
Fig. 19. Histogram of building height errors in the study area.
to reconstruct LoD1 building models from TerraSAR-X and
TanDEM-X stripmap images.
ACKNOWLEDGMENT
The authors would like to thank Dr. H. Hirschm¨uller of
DLR-RM for providing the optical DEM.
REFERENCES
[1] G. Franceschetti, A. Iodice, and D. Riccio, “A canonical problem in
electromagnetic backscattering from buildings,” IEEE Transactions on
Geoscience and Remote Sensing, vol. 40, no. 8, pp. 1787–1801, 2002.
[2] F. Tupin and M. Roux, “Detection of building outlines based on the
fusion of SAR and optical features,” ISPRS Journal of Photogrammetry
and Remote Sensing, vol. 58, no. 1-2, pp. 71–82, 2003.
[3] R. Guida, A. Iodice, and D. Riccio, “Height retrieval of isolated
buildings from single high-resolution SAR images,” IEEE Transactions
on Geoscience and Remote Sensing, vol. 48, no. 7, pp. 2967–2979, 2010.
[4] D. Brunner, G. Lemoine, L. Bruzzone, and H. Greidanus, “Building
height retrieval from VHR SAR imagery based on an iterative simulation
and matching technique,” IEEE Transactions on Geoscience and Remote
Sensing, vol. 48, no. 3, pp. 1487–1504, 2010.
[5] H. Sportouche, F. Tupin, and L. Denise, “Extraction and three-
dimensional reconstruction of isolated buildings in urban scenes from
high-resolution optical and SAR spaceborne images,” IEEE Transactions
on Geoscience and Remote Sensing, vol. 49, no. 10, pp. 3932–3946,
2011.
[6] L. Wen and F. Yamazaki, “Building height detection from high-
resolution TerraSAR-X imagery and GIS data,” in Joint Urban Remote
Sensing Event (JURSE), 2013.
[7] X. X. Zhu and M. Shahzad, “Facade reconstruction using multiview
spaceborne TomoSAR point clouds,” IEEE Transactions on Geoscience
and Remote Sensing, vol. 52, no. 6, pp. 3541–3552, 2014.
[8] B. Huang, Y. Li, X. Han, Y. Cui, W. Li, and R. Li, “Cloud removal from
optical satellite imagery with SAR imagery using sparse representation,”
IEEE Geoscience and Remote Sensing Letters, vol. 12, no. 5, pp. 1046–
1050, 2015.
[9] D. Brunner, G. Lemoine, and L. Bruzzone, “Earthquake damage as-
sessment of buildings using VHR optical and SAR imagery,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 48, no. 5, pp.
2403–2420, 2010.
[10] T.-L. Wang and Y.-Q. Jin, “Postearthquake building damage assess-
ment using multi-mutual information from pre-event optical image and
postevent SAR image,” IEEE Geoscience and Remote Sensing Letters,
vol. 9, no. 3, pp. 452–456, 2012.
[11] Y. Sun, Y. Hua, L. Mou, and X. X. Zhu, “Large-scale building height
estimation from single VHR SAR image using fully convolutional
network and GIS building footprints,” in Joint Urban Remote Sensing
Event (JURSE), 2019.
[12] M. Shahzad, M. Maurer, F. Fraundorfer, Y. Wang, and X. X. Zhu,
“Buildings detection in VHR SAR images using fully convolution neural
networks,” IEEE Transactions on Geoscience and Remote Sensing,
vol. 57, no. 2, pp. 1100–1116, 2019.
[13] F. Xu and Y.-Q. Jin, “Automatic reconstruction of building objects
from multiaspect meter-resolution SAR images,” IEEE Transactions on
Geoscience and Remote Sensing, vol. 45, no. 7, pp. 2336–2353, 2007.
[14] E. Michaelsen, U. Soergel, and U. Thoennessen, “Perceptual grouping
for automatic detection of man-made structures in high-resolution SAR
data,” Pattern Recognition Letters, vol. 27, no. 4, pp. 218–225, 2006.
[15] U. Soergel, E. Michaelsen, A. Thiele, E. Cadario, and U. Thoennessen,
“Stereo analysis of high-resolution SAR images for building height
estimation in cases of orthogonal aspect directions,” ISPRS Journal of
Photogrammetry and Remote Sensing, vol. 64, no. 5, pp. 490–500, 2009.
[16] A. Ferro, D. Brunner, and L. Bruzzone, “Automatic detection and recon-
struction of building radar footprints from single VHR SAR images,”
IEEE Transactions on Geoscience and Remote Sensing, vol. 51, no. 2,
pp. 935–952, 2013.
[17] R. Hill, C. Moate, and D. Blacknell, “Estimating building dimensions
from synthetic aperture radar image sequences,” IET Radar, Sonar &
Navigation, vol. 2, no. 3, p. 189, 2008.
[18] R. Bolter and F. Leberl, “Shape-from-shadow building reconstruction
from multiple view SAR images,” in 24th workshop of the Austrian
Association for Pattern Recognition ( ¨OAGM/AAPR), 2000.
[19] F. Cellier, H. Oriot, and J.-M. Nicolas, “Introduction of the mean
shift algorithm in SAR imagery: Application to shadow extraction
for building reconstruction,” in EARSeL Workshop 3D-Remote Sensing,
2005.
[20] L. Zhao, X. Zhou, and G. Kuang, “Building detection from urban
SAR image using building characteristics and contextual information,”
EURASIP Journal on Advances in Signal Processing, vol. 2013, no. 1,
p. 56, 2013.
[21] H. Sportouche, F. Tupin, and L. Denise, “Building extraction and 3D
reconstruction in urban areas from high-resolution optical and SAR
imagery,” in Joint Urban Remote Sensing Event (JURSE), 2009.
[22] A. Thiele, C. Dubois, E. Cadario, and S. Hinz, “GIS-supported iterative
ﬁltering approach for building height estimation from InSAR data,” in
European Conference on Synthetic Aperture Radar (EUSAR), 2012.
[23] Y. Zhang, X. Sun, A. Thiele, and S. Hinz, “Stochastic geometrical model
and monte carlo optimization methods for building reconstruction from
InSAR data,” ISPRS Journal of Photogrammetry and Remote Sensing,
vol. 108, pp. 49–61, 2015.
[24] M. Quartulli and M. Datcu, “Stochastic geometrical modeling for built-
up area understanding from a single SAR intensity image with meter
resolution,” IEEE Transactions on Geoscience and Remote Sensing,
vol. 42, no. 9, pp. 1996–2003, 2004.
[25] R. Guida, A. Iodice, D. Riccio, and U. Stilla, “Model-based interpre-
tation of high-resolution SAR images of buildings,” IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensing,
vol. 1, no. 2, pp. 107–119, 2008.
[26] E. Simonetto, H. Oriot, and R. Garello, “Rectangular building extrac-
tion from stereoscopic airborne radar images,” IEEE Transactions on
Geoscience and Remote Sensing, vol. 43, no. 10, pp. 2386–2395, 2005.
[27] Y. Wang, F. Tupin, C. Han, and J.-M. Nicolas, “Building detection
from high resolution PolSAR data by combining region and edge
information,” in IEEE International Geoscience and Remote Sensing
Symposium (IGARSS), 2008.
[28] B. Liu, K. Tang, and J. Liang, “A bottom-up/top-down hybrid algorithm
for model-based building detection in single very high resolution SAR
image,” IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 6,
pp. 926–930, 2017.
[29] F. Zhang, Y. Shao, X. Zhang, and T. Balz, “Building L-shape footprint
extraction from high resolution SAR image,” in Joint Urban Remote
Sensing Event (JURSE), 2011.
[30] J. D. Wegner, J. R. Ziehn, and U. Soergel, “Combining high-resolution
optical and InSAR features for height estimation of buildings with ﬂat
roofs,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52,
no. 9, pp. 5840–5854, 2014.
[31] A. Thiele, E. Cadario, K. Schulz, and U. Soergel, “Analysis of gable-
roofed building signature in multiaspect InSAR data,” IEEE Geoscience
and Remote Sensing Letters, vol. 7, no. 1, pp. 83–87, 2010.
[32] J. Chen, C. Wang, H. Zhang, F. Wu, B. Zhang, and W. Lei, “Automatic
detection of low-rise gable-roof building from single submeter SAR
images based on local multilevel segmentation,” Remote Sensing, vol. 9,
no. 3, p. 263, 2017.
[33] R. Guo and X. X. Zhu, “High-rise building feature extraction using
high resolution spotlight TanDEM-X data,” in European Conference on
Synthetic Aperture Radar (EUSAR), 2014.
[34] W. Liu, K. Suzuki, and F. Yamazaki, “Height estimation for high-rise
buildings based on InSAR analysis,” in Joint Urban Remote Sensing
Event (JURSE), 2015.

ACCEPTED ARTICLE: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, NOVEMBER 2020
14
[35] K. Tang, B. Liu, and B. Zou, “High-rise building detection in dense
urban area based on high resolution SAR images,” in IEEE International
Geoscience and Remote Sensing Symposium (IGARSS), 2016.
[36] S. Chen, H. Wang, F. Xu, and Y.-Q. Jin, “Automatic recognition of
isolated buildings on single-aspect SAR image using range detector,”
IEEE Geoscience and Remote Sensing Letters, vol. 12, no. 2, pp. 219–
223, 2015.
[37] P. Lu, K. Du, W. Yu, and H. Feng, “New building signature extraction
method from single very high-resolution synthetic aperture radar images
based on symmetric analysis,” Journal of Applied Remote Sensing,
vol. 9, no. 1, p. 095072, 2015.
[38] M. Shahzad and X. X. Zhu, “Automatic detection and reconstruction
of 2-D/3-D building shapes from spaceborne TomoSAR point clouds,”
IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 3,
pp. 1292–1310, 2016.
[39] X. X. Zhu and R. Bamler, “Very high resolution spaceborne SAR
tomography in urban environment,” IEEE Transactions on Geoscience
and Remote Sensing, vol. 48, no. 12, pp. 4296–4308, 2010.
[40] J. D. Wegner, U. Soergel, and A. Thiele, “Building extraction in urban
scenes from high-resolution InSAR data and optical imagery,” in Joint
Urban Remote Sensing Event (JURSE), 2009.
[41] A. Thiele, S. Hinz, and E. Cadario, “Combining GIS and InSAR data
for 3D building reconstruction,” in IEEE International Geoscience and
Remote Sensing Symposium (IGARSS), 2010.
[42] Y. Sun, M. Shahzad, and X. X. Zhu, “Building height estimation in
single SAR image using OSM building footprints,” in Joint Urban
Remote Sensing Event (JURSE), 2017.
[43] L. Zhang, L. Zhang, and B. Du, “Deep learning for remote sensing
data: A technical tutorial on the state of the art,” IEEE Geoscience and
Remote Sensing Magazine, vol. 4, no. 2, pp. 22–40, 2016.
[44] X. X. Zhu, D. Tuia, L. Mou, G. Xia, L. Zhang, F. Xu, and F. Fraundorfer,
“Deep learning in remote sensing: A comprehensive review and list of
resources,” IEEE Geoscience and Remote Sensing Magazine, vol. 5,
no. 4, pp. 8–36, 2017.
[45] L. Mou, Y. Hua, and X. X. Zhu, “A relation-augmented fully convo-
lutional network for semantic segmentation in aerial scenes,” in IEEE
International Conference on Computer Vision and Pattern Recognition
(CVPR), 2019.
[46] N. Kussul, M. Lavreniuk, S. Skakun, and A. Shelestov, “Deep learning
classiﬁcation of land cover and crop types using remote sensing data,”
IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 5, pp. 778–
782, 2017.
[47] G. Cheng, C. Yang, X. Yao, L. Guo, and J. Han, “When deep learning
meets metric learning: Remote sensing image scene classiﬁcation via
learning discriminative CNNs,” IEEE transactions on geoscience and
remote sensing, vol. 56, no. 5, pp. 2811–2821, 2018.
[48] L. Mou and X. X. Zhu, “IM2HEIGHT: Height estimation from sin-
gle monocular imagery via fully residual convolutional-deconvolutional
network,” arXiv:1802.10249, 2018.
[49] R. Kemker, C. Salvaggio, and C. Kanan, “Algorithms for semantic seg-
mentation of multispectral remote sensing imagery using deep learning,”
ISPRS Journal of Photogrammetry and Remote Sensing, vol. 145, pp.
60–77, 2018.
[50] N. Audebert, B. Le Saux, and S. Lef`evre, “Beyond RGB: Very high
resolution urban remote sensing with multimodal deep networks,” ISPRS
Journal of Photogrammetry and Remote Sensing, vol. 140, pp. 20–32,
2018.
[51] Y. Hua, L. Mou, and X. X. Zhu, “Relation network for multilabel aerial
image classiﬁcation,” IEEE Transactions on Geoscience and Remote
Sensing, 10.1109/TGRS.2019.2963364.
[52] L. Mou, Y. Hua, and X. X. Zhu, “Relation matters: Relational context-
aware fully convolutional network for semantic segmentation of high-
resolution aerial images,” IEEE Transactions on Geoscience and Remote
Sensing, 10.1109/TGRS.2020.2979552.
[53] Q. Lv, Y. Dou, X. Niu, J. Xu, J. Xu, and F. Xia, “Urban land use and
land cover classiﬁcation using remotely sensed SAR data through deep
belief networks,” Journal of Sensors, vol. 2015, pp. 1–10, 2015.
[54] Y. Zhou, H. Wang, F. Xu, and Y.-Q. Jin, “Polarimetric SAR image clas-
siﬁcation using deep convolutional neural networks,” IEEE Geoscience
and Remote Sensing Letters, vol. 13, no. 12, pp. 1935–1939, 2016.
[55] Z. Zhang, H. Wang, F. Xu, and Y.-Q. Jin, “Complex-valued convolu-
tional neural network and its application in polarimetric SAR image
classiﬁcation,” IEEE Transactions on Geoscience and Remote Sensing,
vol. 55, no. 12, pp. 7177–7188, 2017.
[56] Z. Zhao, L. Jiao, J. Zhao, J. Gu, and J. Zhao, “Discriminant deep
belief network for high-resolution SAR image classiﬁcation,” Pattern
Recognition, vol. 61, pp. 686–701, 2017.
[57] J. Geng, H. Wang, J. Fan, and X. Ma, “Deep supervised and contractive
neural network for SAR image classiﬁcation,” IEEE Transactions on
Geoscience and Remote Sensing, vol. 55, no. 4, pp. 2442–2459, 2017.
[58] Y. Duan, F. Liu, L. Jiao, P. Zhao, and L. Zhang, “SAR image segmenta-
tion based on convolutional-wavelet neural network and markov random
ﬁeld,” Pattern Recognition, vol. 64, pp. 255–267, 2017.
[59] F. Mohammadimanesh, B. Salehi, M. Mahdianpari, E. Gill, and
M. Molinier, “A new fully convolutional neural network for semantic
segmentation of polarimetric SAR imagery in complex land cover
ecosystem,” ISPRS Journal of Photogrammetry and Remote Sensing,
vol. 151, pp. 223–236, 2019.
[60] S. Chen and H. Wang, “SAR target recognition based on deep learning,”
in International Conference on Data Science and Advanced Analytics
(DSAA), 2014.
[61] J. Ding, B. Chen, H. Liu, and M. Huang, “Convolutional neural network
with data augmentation for SAR target recognition,” IEEE Geoscience
and Remote Sensing Letters, pp. 1–5, 2016.
[62] M. Kang, K. Ji, X. Leng, and Z. Lin, “Contextual region-based convo-
lutional neural network with multilayer fusion for SAR ship detection,”
Remote Sensing, vol. 9, no. 8, p. 860, 2017.
[63] F. Gao, T. Huang, J. Sun, J. Wang, A. Hussain, and E. Yang, “A new
algorithm for SAR image target recognition based on an improved deep
convolutional neural network,” Cognitive Computation, vol. 11, no. 6,
pp. 809–824, 2019.
[64] M. Gong, J. Zhao, J. Liu, Q. Miao, and L. Jiao, “Change detection in
synthetic aperture radar images based on deep neural networks,” IEEE
Transactions on Neural Networks and Learning Systems, vol. 27, no. 1,
pp. 125–138, 2016.
[65] F. Gao, J. Dong, B. Li, and Q. Xu, “Automatic change detection in
synthetic aperture radar images based on PCANet,” IEEE Geoscience
and Remote Sensing Letters, vol. 13, no. 12, pp. 1792–1796, 2016.
[66] J. Geng, X. Ma, X. Zhou, and H. Wang, “Saliency-guided deep neural
networks for SAR image change detection,” IEEE Transactions on
Geoscience and Remote Sensing, vol. 57, no. 10, pp. 7365–7377, 2019.
[67] X. Wang, L. Cavigelli, M. Eggimann, M. Magno, and L. Benini, “HR-
SAR-Net: A deep neural network for urban scene segmentation from
high-resolution SAR data,” arxiv:1912.04441, 2019.
[68] J. Shermeyer, D. Hogan, J. Brown, A. Van Etten, N. Weir, F. Paciﬁci,
R. Haensch, A. Bastidas, S. Soenen, T. Bacastow et al., “SpaceNet 6:
Multi-sensor all weather mapping dataset,” arxiv:2004.06500, 2020.
[69] R. Weibel, “Digital terrain modelling,” Geographical Information Sys-
tems: Principles and Applications, pp. 269–297, 1991.
[70] S. Katz, A. Tal, and R. Basri, “Direct visibility of point sets,” ACM
Transactions on Graphics, vol. 26, no. 3, p. 24, 2007.
[71] J. C. Curlander, “Location of spaceborne SAR imagery,” IEEE Transac-
tions on Geoscience and Remote Sensing, vol. GE-20, no. 3, pp. 359–
364, 1982.
[72] M. Schwabisch, “A fast and efﬁcient technique for SAR interferogram
geocoding,” in International Geoscience and Remote Sensing Sympo-
sium Proceedings (IGARSS), 1998.
[73] T. Toutin, “Geometric processing of remote sensing images: models,
algorithms and methods,” International journal of remote sensing,
vol. 25, no. 10, pp. 1893–1924, 2004.
[74] A. Roth, M. Huber, and D. Kosmann, “Geocoding of TerraSAR-X data,”
in International Congress of the ISPRS, 2004.
[75] F. R. Gonzalez, N. Adam, A. Parizzi, and R. Brcic, “The Integrated Wide
Area Processor (IWAP): A processor for wide area persistent scatterer
interferometry,” in ESA Living Planet Symposium, 2013.
[76] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” in IEEE International Conference on
Learning Representation (ICLR), 2015.
[77] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classiﬁca-
tion with deep convolutional neural networks,” in Advances in Neural
Information Processing Systems (NIPS), 2012.
[78] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” in International
Conference on Machine Learning (ICML), 2015.
[79] G. Baier, X. X. Zhu, M. Lachaise, H. Breit, and R. Bamler, “Nonlocal
InSAR ﬁltering for DEM generation and addressing the staircasing
effect,” in European Conference on Synthetic Aperture Radar (EUSAR),
2016.
[80] H. Hirschmuller, “Stereo processing by semiglobal matching and mutual
information,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 30, no. 2, pp. 328–341, 2008.
[81] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
A large-scale hierarchical image database,” in IEEE International Con-
ference on Computer Vision and Pattern Recognition (CVPR), 2009.

ACCEPTED ARTICLE: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, NOVEMBER 2020
15
[82] T. Dozat, “Incorporating Nesterov momentum into Adam,” http://cs229.
stanford.edu/proj2015/054 report.pdf, online.
[83] H. Fan, A. Zipf, Q. Fu, and P. Neis, “Quality assessment for building
footprints data on openstreetmap,” International Journal of Geographi-
cal Information Science, vol. 28, no. 4, pp. 700–719, 2014.
[84] T. H. Kolbe, G. Gr¨oger, and L. Pl¨umer, “CityGML: Interoperable access
to 3D city models,” in International Symposium on Geo-Information for
Disaster Management (Gi4DM), 2005.
