Nicolas Audebert,[ONERA, The French Aerospace Lab]
Abstract
- Investigate various methods to deal with semantic labeling of very high resolution multi-modal RS data – especially for semantic labeling
- Contributions: 1. Multiscale approach to leverage both large spatial context and HR data
2. Study on early and later stage lidar fusion. 3. Validate based on public dataset - Findings: Late fusion – to recover errors from ambiguous data. Early fusion – better joint feature learning
Introduction
- Adapting vision based deep network to multispectral/LiDAR is not trivial. Because new data structure is required.
- Data – LiDAR and multispectral over urban area.
- Contributions:
- 1.Implement deep FCN using SegNet and ResNet.
- 2. Investigate early fusion based on FuseNet principle.
- 3. Investigate late fusion based on residual correction strategy.
Related Work
- FCN Improvements:
- Convolutional auto-encoders with symmetrical architecture (SegNet, DeconvNet) (1:1 resolution, because upsample to the original input resolution)
- Removing the pooling layers from standard CNN and using dilated convolutions to preserve most of the information. – multiscale DeepLab – predictions in several resolutions with seperate branches(1:8)
- Residual in ResNet – 1:8 resolution
- Multimodal data processing:
- Dual stream autoencoders:
- FuseNet: dual to SegNet on early fusion scheme.
- Patch-based CNN – produces only coarse maps as entire patch gets associated with only one label.
- Dense maps can be obtained by sliding a window over the entire input(but expensive and slow)
- Superpixel based classification – batch based with unsupervised pre-segmentation.
- Convolutional auto-encoders with symmetrical architecture (SegNet, DeconvNet) (1:1 resolution, because upsample to the original input resolution)
Method
- Semantic segmentation of aerial images
- Base Model – SegNet
- Encoder of SegNet – Convolutional layers of VGG-16
- End of encoder – W/32 and H/32
- Decoder – upsampling and classification
- ResNet-34 – 4 residual blocks
- Softmax layer to compute multinomial logistic loss averaged over the whole patch
- Multi-scale aspects
- Addressed using pyramidal approach: different context sizes and different resolutions fed as parallel inputs to one or multiple classifiers.
- Early fusion
- FuseNet: SegNet in multimodel – jointly encodes both RGB and depth using 2 encoders and summed after each convolution block
- Virtual FuseNet: to make FuseNet symmetrical(with additional hyperparameter to tune)
- Late fusion
- When inputs are topologically compatible, early fusion won’t work.
- Each FCN generates a prediction, we average the 2 predictions to obtain a smooth classification map. & retrain the correction module in residual fashion.
- Class balancing
- Balance the loss using inverse class frequencies(except rare classes – same weight on this class as the lowest weight on all other classes.)
Experiments
- Datasets
- ISPRS 2D Semantic Labeling Challenge. – VHR aerial images over 2 cities – 6 classes
- ISPRS Vaihingen – resolution:9cm/pixel with 2100×2100 tiles. 33 images with 16 having ground truth. IRRG & DSM. Also normalized DSM
- ISPRS Potsdam – resolution:5cm/pixel with 6000×6000 tiles. 38 images,with 24 have ground truth. IRRGB & DSM &nDSM
- ISPRS 2D Semantic Labeling Challenge. – VHR aerial images over 2 cities – 6 classes
- Experimental setup
- Build a composite image comprised of DSM, nDSM, & NDVI
- Sliding window approach to extract 128×128 patches. – stride of sliding window defines the size of overlapping regions between 2 consecutive patches.
- Training time: smaller stride to extract more training samples – data augmentation. (64px stride-Postdam, 32px
- Testing time: smaller stride allows us to average the predictions on the overlapping regions, reduces border effects and improves overall accuracy.
- SGD with lr=0.01, momentum=0.9, weight decay = 0.0005, batch size = 10
- Segnet initialized with VGG-16 trained on ImageNet.
- Decoder weights are randomly initialized.
- Divide the learning rate by 10 after 5,10,15 epochs.
- For resnet based model, divide the learning rate by 10 after 20 & 40 epochs. In both cases, lr of pre-initialized weights is set as half the learning of new weights.
- Results cross validated on each dataset using a 3-fold split
- Final model for testing are re-trained on the whole training set.
- Results
Discussion
- Baselines and preliminary experiments
- Baseline: train std SegNet and ResNet on IRRG and composite versions of datasets.
- Compared to SegNet, ResNet is more stable, but requires more memory.
- Effects of multiscale strategy
- Increasing number of branches improves the overall classification, but by a smaller margin each time.
- Effects of fusion strategies.
- Late fusion manages to mainly to recover cars(which were less detected)
Link: https://arxiv.org/abs/1711.08681