Pedestrian Detection in Thermal Images using Saliency Maps

You are currently viewing Pedestrian Detection in Thermal Images using Saliency Maps

Abstract

  • Thermal images are good at predicting objects/people at night, but poor performance in daylight
  • SoA networks use fusion networks with paired thermal and RGB images.
  • Contribution: augment thermal images with saliency maps, to serve as an attention mechanism. – eliminated the need for paired colour images
  • Network: Faster R-CNN
  • Saliency maps generation using static and deep methods(PiCA-Net and R3-Net
  • Dataset: KAIST Multispectral Pedestrian Detection Dataset.

Introduction

  • Nighttime, thermal cameras capture humans distinctly as they are warmer than their surrounding objects
  • During day, there are other objects warmer than human. – so less distinguishable
  • Colour-thermal pair is difficult, bcz image registration is required. 
  • Saliency – How different a given location from it’s surrounding in colour, orientation, motion and depth
  • Baseline – RCNN to detect pedestrians solely from thermal images in KAIST dataset
  • Pedestrian detection trained using augmented images outperformed the baseline
  • Pixel level annotation for KAIST dataset(created by them) – for deep saliency network

Related Work

  • Pedestrian detection
    • Traditional – handcrafted features and algorithms – ICF, ACF, LDCF
    • Zhang et al. – Faster RCNN for pedestrian detection
    • Sermanet et al. – multistage supervised features and skip connection
    • Li et. al – Scale Aware fast RCNN with built-in subnetworks to detect pedestrians at different scales
    • Brazi et. al – SDS RCNN joint supervision on pedestrian detection and semantic segmentation to illuminate pedestrians in the frame -> motivation for saliency
    • Liu et al – fusion method based on faster RCNN
    • Li et al – Illumination Aware Faster RCNN which adaptively integrates colour and thermal sub-networks, fuses the results using a weighting scheme depending on illumination condition
    • Region Reconstruction Network – models the relation between RGB and Thermal data using a CNN, and these features to a multiscale detection network
  • Saliency detection
    • To highlight the most conspicuous object in an image
    • Traditional – global contrast, local contrast, colour and texture
    • Recent works – CNNs for salient object detection
    • DHSNet – first learns global saliency cues such as global contrast, objectness and compactness -> then to a novel hierarchical CNN to refine details
    • Holistically-Nested edge detector – use of short connections to the skip layer structure
    • Amulet – multi-level features at multiple resolutions and learn to predict saliency map by combining the features at each resolution in a recursive manner

Approach

The baseline for pedestrian detection in thermal images using faster RCNN

  • Faster RCNN end to end trained on thermal images

Our Approach: Using saliency maps for improving pedestrian detection

  • Replace one duplicate channel of the 3 channel thermal images with the corresponding saliency map

Static Saliency

  • Using OpenCV library. But it highlights not only pedestrians but also other objects

Deep Saliency Networks

  • PiCA-Net – pixel-wise contextual attention network – generates an attention map for each pixel corresponding to its relevance at each location. Uses bidirectional LSTM to scan the image horizontally and vertically around a pixel to obtain its global context. For the local context, the attention operation is performed on the local neighbouring region using convolutional layers. 
  • UNet architecture to integrate PiCA-Net hierarchically for salient object detection.
  • R3Net – Uses a residual refinement block(RRB) to learn residuals between ground truth and saliency map in recursive manner. 

Out Dataset: Annotating KAIST Multispectral pedestrian salient pedestrian detection.

  • 913-day images and 789-night images(training)
  • manually annotate these images using the VGG Image Annotator
  • 193-day images and 169-night images(testing)

Experiments

Datasets and Evaluation Protocols

  • Out of 50k training images and 45k testing images. 3 frames from training videos and 20 frames from testing. And exclude <50 pixels pedestrian instances. -> 7.6k training & 2.2k test images
  • Trained Deep saliency network used to create saliency map for the above train and test images
  • Evaluation Pedestrian detection – Log average miss rate(LAMR) over the range[10^-2,10^0] against false positives per image(FPPI). Also mAP of detection. 
  • Evaluation of saliency detection – F-measure score(weighted harmonic mean of precision and recall). And Mean Absolute Error (MAE)

Implementation Details

  • Faster RCNN for pedestrian detection
    • Modifications – removed 5th maxpooling layer of VGG16 backbone network.
    • Original Faster RCNN used 3 scales and 3 ratios for reference anchor, but here they use 9 scales for the reference anchor b/w 0.05 and 4. 
    • FRCNN initialized with VGG16 weights pre-trained on ImageNet and fine-tuned for 6 epochs
    • Fix the first 2 CNN layers of VGG16 and fine-tune the rest(SGD, momentum=0.9, lr=0.001, batch size =1
  • Deep saliency network
    • Train PiCA-net and R^3-Net on thermal images with pixel-level annotations. 
    • PiCA:- Augmentation: random mirror flipping and random cropping. Decoder trained from scratch(lr = 0.01) encoder fine tuned(lr=0.001) for 16 epochs and decayed by 0.1 for another 16 epochs. SGD with momentum 0.9 and weight decay 0.0005, batch size =4. Resize to 224×224 by Lanczos interpolation
    • R3-Net:- initialized with weight from ResNeXt network. SGD, lr=0.001, momentum=0.9, weight decay=0.0005. 9000 iterations, batch size=10

Results and Analysis

  • Performance of Deep Saliency Networks on our KAIST Salient Pedestrian Detection dataset
    • Saliency maps generated from R3-Net post processed using CRF to improve coherence -> better results
  • Quantitative analysis of Pedestrian Detection in Thermal Images using Saliency Maps
    • Using only thermal images: produces a miss rate of 44.2% on day images and 40.4% on night images. 
    • Using thermal images using static saliency maps: Day time- 39.4%(miss rate) ie 4.8% improvement(but no improvement in night time)
    • Using Thermal Images with Saliency Maps generated from Deep Networks: PiCA-Net – 32.2% for day images, 21.7% for night images. And R3-Net 30.4% for day and 21% for night. 
    • R3-Net mAP of 68.5% during day time(6.9% improvement) and 73.2% during night time(7.7% improvement)

Qualitative analysis and effectiveness of saliency maps for Pedestrian Detection

Conclusion and Future Work

  • In this paper, channel replacement for augmented thermal images. suggests-> Saliency proposal stage and then jointly learn pedestrian detection and saliency detection like SDS-RCNN – 
  • Pixel level annotation data(large amount) might give better results

Link: https://arxiv.org/abs/1904.06859

Leave a Reply