Nicolas Audebert,[ONERA, The French Aerospace Lab]

Abstract

Investigate various methods to deal with semantic labeling of very high resolution multi-modal RS data – especially for semantic labeling
Contributions: 1. Multiscale approach to leverage both large spatial context and HR data
2. Study on early and later stage lidar fusion. 3. Validate based on public dataset
Findings: Late fusion – to recover errors from ambiguous data. Early fusion – better joint feature learning

Introduction

Adapting vision based deep network to multispectral/LiDAR is not trivial. Because new data structure is required.
Data – LiDAR and multispectral over urban area.
Contributions:
- 1.Implement deep FCN using SegNet and ResNet.
- 2. Investigate early fusion based on FuseNet principle.
- 3. Investigate late fusion based on residual correction strategy.

Semantic segmentation of aerial images
- Base Model – SegNet
- Encoder of SegNet – Convolutional layers of VGG-16
- End of encoder – W/32 and H/32
- Decoder – upsampling and classification
- ResNet-34 – 4 residual blocks
- Softmax layer to compute multinomial logistic loss averaged over the whole patch
Multi-scale aspects
- Addressed using pyramidal approach: different context sizes and different resolutions fed as parallel inputs to one or multiple classifiers.
Early fusion
- FuseNet: SegNet in multimodel – jointly encodes both RGB and depth using 2 encoders and summed after each convolution block
- Virtual FuseNet: to make FuseNet symmetrical(with additional hyperparameter to tune)

Late fusion
- When inputs are topologically compatible, early fusion won’t work.
- Each FCN generates a prediction, we average the 2 predictions to obtain a smooth classification map. & retrain the correction module in residual fashion.
Class balancing
- Balance the loss using inverse class frequencies(except rare classes – same weight on this class as the lowest weight on all other classes.)

Datasets
- ISPRS 2D Semantic Labeling Challenge. – VHR aerial images over 2 cities – 6 classes
  - ISPRS Vaihingen – resolution:9cm/pixel with 2100×2100 tiles. 33 images with 16 having ground truth. IRRG & DSM. Also normalized DSM
  - ISPRS Potsdam – resolution:5cm/pixel with 6000×6000 tiles. 38 images,with 24 have ground truth. IRRGB & DSM &nDSM
Experimental setup
- Build a composite image comprised of DSM, nDSM, & NDVI
- Sliding window approach to extract 128×128 patches. – stride of sliding window defines the size of overlapping regions between 2 consecutive patches.
- Training time: smaller stride to extract more training samples – data augmentation. (64px stride-Postdam, 32px
- Testing time: smaller stride allows us to average the predictions on the overlapping regions, reduces border effects and improves overall accuracy.
- SGD with lr=0.01, momentum=0.9, weight decay = 0.0005, batch size = 10
- Segnet initialized with VGG-16 trained on ImageNet.
- Decoder weights are randomly initialized.
- Divide the learning rate by 10 after 5,10,15 epochs.
- For resnet based model, divide the learning rate by 10 after 20 & 40 epochs. In both cases, lr of pre-initialized weights is set as half the learning of new weights.
- Results cross validated on each dataset using a 3-fold split
- Final model for testing are re-trained on the whole training set.
Results

Baselines and preliminary experiments
- Baseline: train std SegNet and ResNet on IRRG and composite versions of datasets.
- Compared to SegNet, ResNet is more stable, but requires more memory.
Effects of multiscale strategy
- Increasing number of branches improves the overall classification, but by a smaller margin each time.
Effects of fusion strategies.
- Late fusion manages to mainly to recover cars(which were less detected)

Link: https://arxiv.org/abs/1711.08681