Abstract
- Thermal images are good at predicting objects/people at night, but poor performance in daylight
- SoA networks use fusion networks with paired thermal and RGB images.
- Contribution: augment thermal images with saliency maps, to serve as an attention mechanism. – eliminated the need for paired colour images
- Network: Faster R-CNN
- Saliency maps generation using static and deep methods(PiCA-Net and R3-Net
- Dataset: KAIST Multispectral Pedestrian Detection Dataset.
Introduction
- Nighttime, thermal cameras capture humans distinctly as they are warmer than their surrounding objects
- During day, there are other objects warmer than human. – so less distinguishable
- Colour-thermal pair is difficult, bcz image registration is required.
- Saliency – How different a given location from it’s surrounding in colour, orientation, motion and depth
- Baseline – RCNN to detect pedestrians solely from thermal images in KAIST dataset
- Pedestrian detection trained using augmented images outperformed the baseline
- Pixel level annotation for KAIST dataset(created by them) – for deep saliency network
Related Work
- Pedestrian detection
- Traditional – handcrafted features and algorithms – ICF, ACF, LDCF
- Zhang et al. – Faster RCNN for pedestrian detection
- Sermanet et al. – multistage supervised features and skip connection
- Li et. al – Scale Aware fast RCNN with built-in subnetworks to detect pedestrians at different scales
- Brazi et. al – SDS RCNN joint supervision on pedestrian detection and semantic segmentation to illuminate pedestrians in the frame -> motivation for saliency
- Liu et al – fusion method based on faster RCNN
- Li et al – Illumination Aware Faster RCNN which adaptively integrates colour and thermal sub-networks, fuses the results using a weighting scheme depending on illumination condition
- Region Reconstruction Network – models the relation between RGB and Thermal data using a CNN, and these features to a multiscale detection network
- Saliency detection
- To highlight the most conspicuous object in an image
- Traditional – global contrast, local contrast, colour and texture
- Recent works – CNNs for salient object detection
- DHSNet – first learns global saliency cues such as global contrast, objectness and compactness -> then to a novel hierarchical CNN to refine details
- Holistically-Nested edge detector – use of short connections to the skip layer structure
- Amulet – multi-level features at multiple resolutions and learn to predict saliency map by combining the features at each resolution in a recursive manner
Approach
The baseline for pedestrian detection in thermal images using faster RCNN
- Faster RCNN end to end trained on thermal images
Our Approach: Using saliency maps for improving pedestrian detection
- Replace one duplicate channel of the 3 channel thermal images with the corresponding saliency map
Static Saliency
- Using OpenCV library. But it highlights not only pedestrians but also other objects
Deep Saliency Networks
- PiCA-Net – pixel-wise contextual attention network – generates an attention map for each pixel corresponding to its relevance at each location. Uses bidirectional LSTM to scan the image horizontally and vertically around a pixel to obtain its global context. For the local context, the attention operation is performed on the local neighbouring region using convolutional layers.
- UNet architecture to integrate PiCA-Net hierarchically for salient object detection.
- R3Net – Uses a residual refinement block(RRB) to learn residuals between ground truth and saliency map in recursive manner.
Out Dataset: Annotating KAIST Multispectral pedestrian salient pedestrian detection.
- 913-day images and 789-night images(training)
- manually annotate these images using the VGG Image Annotator
- 193-day images and 169-night images(testing)
Experiments
Datasets and Evaluation Protocols
- Out of 50k training images and 45k testing images. 3 frames from training videos and 20 frames from testing. And exclude <50 pixels pedestrian instances. -> 7.6k training & 2.2k test images
- Trained Deep saliency network used to create saliency map for the above train and test images
- Evaluation Pedestrian detection – Log average miss rate(LAMR) over the range[10^-2,10^0] against false positives per image(FPPI). Also mAP of detection.
- Evaluation of saliency detection – F-measure score(weighted harmonic mean of precision and recall). And Mean Absolute Error (MAE)
Implementation Details
- Faster RCNN for pedestrian detection
- Modifications – removed 5th maxpooling layer of VGG16 backbone network.
- Original Faster RCNN used 3 scales and 3 ratios for reference anchor, but here they use 9 scales for the reference anchor b/w 0.05 and 4.
- FRCNN initialized with VGG16 weights pre-trained on ImageNet and fine-tuned for 6 epochs
- Fix the first 2 CNN layers of VGG16 and fine-tune the rest(SGD, momentum=0.9, lr=0.001, batch size =1
- Deep saliency network
- Train PiCA-net and R^3-Net on thermal images with pixel-level annotations.
- PiCA:- Augmentation: random mirror flipping and random cropping. Decoder trained from scratch(lr = 0.01) encoder fine tuned(lr=0.001) for 16 epochs and decayed by 0.1 for another 16 epochs. SGD with momentum 0.9 and weight decay 0.0005, batch size =4. Resize to 224×224 by Lanczos interpolation
- R3-Net:- initialized with weight from ResNeXt network. SGD, lr=0.001, momentum=0.9, weight decay=0.0005. 9000 iterations, batch size=10
Results and Analysis
- Performance of Deep Saliency Networks on our KAIST Salient Pedestrian Detection dataset
- Saliency maps generated from R3-Net post processed using CRF to improve coherence -> better results
- Quantitative analysis of Pedestrian Detection in Thermal Images using Saliency Maps
- Using only thermal images: produces a miss rate of 44.2% on day images and 40.4% on night images.
- Using thermal images using static saliency maps: Day time- 39.4%(miss rate) ie 4.8% improvement(but no improvement in night time)
- Using Thermal Images with Saliency Maps generated from Deep Networks: PiCA-Net – 32.2% for day images, 21.7% for night images. And R3-Net 30.4% for day and 21% for night.
- R3-Net mAP of 68.5% during day time(6.9% improvement) and 73.2% during night time(7.7% improvement)
Qualitative analysis and effectiveness of saliency maps for Pedestrian Detection
Conclusion and Future Work
- In this paper, channel replacement for augmented thermal images. suggests-> Saliency proposal stage and then jointly learn pedestrian detection and saliency detection like SDS-RCNN –
- Pixel level annotation data(large amount) might give better results
Link: https://arxiv.org/abs/1904.06859