Beyond RGB: Very High Resolution Urban Remote Sensing With Multimodal Deep Networks

You are currently viewing Beyond RGB: Very High Resolution Urban Remote Sensing With Multimodal Deep Networks

Nicolas Audebert,[ONERA, The French Aerospace Lab]

Abstract

  • Investigate various methods to deal with semantic labeling of very high resolution multi-modal RS data – especially for semantic labeling
  • Contributions: 1. Multiscale approach to leverage both large spatial context and HR data
    2. Study on early and later stage lidar fusion. 3. Validate based on public dataset
  • Findings: Late fusion – to recover errors from ambiguous data. Early fusion – better joint feature learning

Introduction

  • Adapting vision based deep network to multispectral/LiDAR is not trivial. Because new data structure is required. 
  • Data – LiDAR and multispectral over urban area. 
  • Contributions: 
    • 1.Implement deep FCN using SegNet and ResNet
    • 2. Investigate early fusion based on FuseNet principle. 
    • 3. Investigate late fusion based on residual correction strategy. 

Related Work

  • FCN Improvements:
    • Convolutional auto-encoders with symmetrical architecture (SegNet, DeconvNet) (1:1 resolution, because upsample to the original input resolution)
      • Removing the pooling layers from standard CNN and using dilated convolutions to preserve most of the information. – multiscale DeepLab – predictions in several resolutions with seperate branches(1:8)
      • Residual in ResNet – 1:8 resolution
    • Multimodal data processing:
      • Dual stream autoencoders:
      • FuseNet: dual to SegNet on early fusion scheme.  
    • Patch-based CNN – produces only coarse maps as entire patch gets associated with only one label. 
      • Dense maps can be obtained by sliding a window over the entire input(but expensive and slow)
      • Superpixel based classification – batch based with unsupervised pre-segmentation. 

Method

  • Semantic segmentation of aerial images
    • Base Model – SegNet
    • Encoder of SegNet – Convolutional layers of VGG-16
    • End of encoder – W/32 and H/32
    • Decoder – upsampling and classification 
    • ResNet-34 – 4 residual blocks
    • Softmax layer to compute multinomial logistic loss averaged over the whole patch
  • Multi-scale aspects
    • Addressed using pyramidal approach: different context sizes and different resolutions fed as parallel inputs to one or multiple classifiers. 
  • Early fusion
    • FuseNet: SegNet in multimodel – jointly encodes both RGB and depth using 2 encoders and summed after each convolution block
    • Virtual FuseNet: to make FuseNet symmetrical(with additional hyperparameter to tune)
  •  Late fusion
    • When inputs are topologically compatible, early fusion won’t work. 
    • Each FCN generates a prediction, we average the 2 predictions to obtain a smooth classification map. & retrain the correction module in residual fashion. 
  • Class balancing
    • Balance the loss using inverse class frequencies(except rare classes – same weight on this class as the lowest weight on all other classes.)

Experiments

  • Datasets
    • ISPRS 2D Semantic Labeling Challenge. – VHR aerial images over 2 cities – 6 classes
      • ISPRS Vaihingen – resolution:9cm/pixel with 2100×2100 tiles. 33 images with 16 having ground truth. IRRG & DSM. Also normalized DSM
      • ISPRS Potsdam – resolution:5cm/pixel with 6000×6000 tiles. 38 images,with 24 have ground truth. IRRGB & DSM &nDSM
  • Experimental setup
    • Build a composite image comprised of DSM, nDSM, & NDVI
    • Sliding window approach to extract 128×128 patches. – stride of sliding window defines the size of overlapping regions between 2 consecutive patches. 
    • Training time: smaller stride to extract more training samples – data augmentation. (64px stride-Postdam, 32px
    • Testing time: smaller stride allows us to average the predictions on the overlapping regions, reduces border effects and improves overall accuracy.
    • SGD with lr=0.01, momentum=0.9, weight decay = 0.0005, batch size = 10
    • Segnet initialized with VGG-16 trained on ImageNet. 
    • Decoder weights are randomly initialized.
    • Divide the learning rate by 10 after 5,10,15 epochs. 
    • For resnet based model, divide the learning rate by 10 after 20 & 40 epochs. In both cases, lr of pre-initialized weights is set as half the learning of new weights. 
    • Results cross validated on each dataset using a 3-fold split
    • Final model for testing are re-trained on the whole training set. 
  • Results

Discussion

  • Baselines and preliminary experiments
    • Baseline: train std SegNet and ResNet on IRRG and composite versions of datasets. 
    • Compared to SegNet, ResNet is more stable, but requires more memory. 
  • Effects of multiscale strategy
    • Increasing number of branches improves the overall classification, but by a smaller margin each time. 
  • Effects of fusion strategies. 
    • Late fusion manages to mainly to recover cars(which were less detected)

Link: https://arxiv.org/abs/1711.08681

Leave a Reply