November 2, 2021 · Kid ML

Pedestrian Detection in Thermal Images using Saliency Maps

Abstract

Thermal images are good at predicting objects/people at night, but poor performance in daylight
SoA networks use fusion networks with paired thermal and RGB images.
Contribution: augment thermal images with saliency maps, to serve as an attention mechanism. – eliminated the need for paired colour images
Network: Faster R-CNN
Saliency maps generation using static and deep methods(PiCA-Net and R3-Net
Dataset: KAIST Multispectral Pedestrian Detection Dataset.

Introduction

Nighttime, thermal cameras capture humans distinctly as they are warmer than their surrounding objects
During day, there are other objects warmer than human. – so less distinguishable
Colour-thermal pair is difficult, bcz image registration is required.
Saliency – How different a given location from it’s surrounding in colour, orientation, motion and depth
Baseline – RCNN to detect pedestrians solely from thermal images in KAIST dataset
Pedestrian detection trained using augmented images outperformed the baseline
Pixel level annotation for KAIST dataset(created by them) – for deep saliency network

Related Work

Pedestrian detection
- Traditional – handcrafted features and algorithms – ICF, ACF, LDCF
- Zhang et al. – Faster RCNN for pedestrian detection
- Sermanet et al. – multistage supervised features and skip connection
- Li et. al – Scale Aware fast RCNN with built-in subnetworks to detect pedestrians at different scales
- Brazi et. al – SDS RCNN joint supervision on pedestrian detection and semantic segmentation to illuminate pedestrians in the frame -> motivation for saliency
- Liu et al – fusion method based on faster RCNN
- Li et al – Illumination Aware Faster RCNN which adaptively integrates colour and thermal sub-networks, fuses the results using a weighting scheme depending on illumination condition
- Region Reconstruction Network – models the relation between RGB and Thermal data using a CNN, and these features to a multiscale detection network
Saliency detection
- To highlight the most conspicuous object in an image
- Traditional – global contrast, local contrast, colour and texture
- Recent works – CNNs for salient object detection
- DHSNet – first learns global saliency cues such as global contrast, objectness and compactness -> then to a novel hierarchical CNN to refine details
- Holistically-Nested edge detector – use of short connections to the skip layer structure
- Amulet – multi-level features at multiple resolutions and learn to predict saliency map by combining the features at each resolution in a recursive manner

Approach

The baseline for pedestrian detection in thermal images using faster RCNN

Faster RCNN end to end trained on thermal images

Our Approach: Using saliency maps for improving pedestrian detection

Replace one duplicate channel of the 3 channel thermal images with the corresponding saliency map

Static Saliency

Using OpenCV library. But it highlights not only pedestrians but also other objects

Deep Saliency Networks

PiCA-Net – pixel-wise contextual attention network – generates an attention map for each pixel corresponding to its relevance at each location. Uses bidirectional LSTM to scan the image horizontally and vertically around a pixel to obtain its global context. For the local context, the attention operation is performed on the local neighbouring region using convolutional layers.
UNet architecture to integrate PiCA-Net hierarchically for salient object detection.
R3Net – Uses a residual refinement block(RRB) to learn residuals between ground truth and saliency map in recursive manner.

Out Dataset: Annotating KAIST Multispectral pedestrian salient pedestrian detection.

913-day images and 789-night images(training)
manually annotate these images using the VGG Image Annotator
193-day images and 169-night images(testing)

Experiments

Datasets and Evaluation Protocols

Out of 50k training images and 45k testing images. 3 frames from training videos and 20 frames from testing. And exclude <50 pixels pedestrian instances. -> 7.6k training & 2.2k test images
Trained Deep saliency network used to create saliency map for the above train and test images
Evaluation Pedestrian detection – Log average miss rate(LAMR) over the range[10^-2,10^0] against false positives per image(FPPI). Also mAP of detection.
Evaluation of saliency detection – F-measure score(weighted harmonic mean of precision and recall). And Mean Absolute Error (MAE)

Implementation Details

Faster RCNN for pedestrian detection
- Modifications – removed 5th maxpooling layer of VGG16 backbone network.
- Original Faster RCNN used 3 scales and 3 ratios for reference anchor, but here they use 9 scales for the reference anchor b/w 0.05 and 4.
- FRCNN initialized with VGG16 weights pre-trained on ImageNet and fine-tuned for 6 epochs
- Fix the first 2 CNN layers of VGG16 and fine-tune the rest(SGD, momentum=0.9, lr=0.001, batch size =1

Deep saliency network
- Train PiCA-net and R^3-Net on thermal images with pixel-level annotations.
- PiCA:- Augmentation: random mirror flipping and random cropping. Decoder trained from scratch(lr = 0.01) encoder fine tuned(lr=0.001) for 16 epochs and decayed by 0.1 for another 16 epochs. SGD with momentum 0.9 and weight decay 0.0005, batch size =4. Resize to 224×224 by Lanczos interpolation
- R3-Net:- initialized with weight from ResNeXt network. SGD, lr=0.001, momentum=0.9, weight decay=0.0005. 9000 iterations, batch size=10

Results and Analysis

Performance of Deep Saliency Networks on our KAIST Salient Pedestrian Detection dataset
- Saliency maps generated from R3-Net post processed using CRF to improve coherence -> better results
Quantitative analysis of Pedestrian Detection in Thermal Images using Saliency Maps
- Using only thermal images: produces a miss rate of 44.2% on day images and 40.4% on night images.
- Using thermal images using static saliency maps: Day time- 39.4%(miss rate) ie 4.8% improvement(but no improvement in night time)
- Using Thermal Images with Saliency Maps generated from Deep Networks: PiCA-Net – 32.2% for day images, 21.7% for night images. And R3-Net 30.4% for day and 21% for night.
- R3-Net mAP of 68.5% during day time(6.9% improvement) and 73.2% during night time(7.7% improvement)

Qualitative analysis and effectiveness of saliency maps for Pedestrian Detection

Conclusion and Future Work

In this paper, channel replacement for augmented thermal images. suggests-> Saliency proposal stage and then jointly learn pedestrian detection and saliency detection like SDS-RCNN –
Pixel level annotation data(large amount) might give better results

Link: https://arxiv.org/abs/1904.06859

November 2, 2021 · Kid ML

Pedestrian Detection in Thermal Images using Saliency Maps

Abstract

Thermal images are good at predicting objects/people at night, but poor performance in daylight
SoA networks use fusion networks with paired thermal and RGB images.
Contribution: augment thermal images with saliency maps, to serve as an attention mechanism. – eliminated the need for paired colour images
Network: Faster R-CNN
Saliency maps generation using static and deep methods(PiCA-Net and R3-Net
Dataset: KAIST Multispectral Pedestrian Detection Dataset.

Introduction

Nighttime, thermal cameras capture humans distinctly as they are warmer than their surrounding objects
During day, there are other objects warmer than human. – so less distinguishable
Colour-thermal pair is difficult, bcz image registration is required.
Saliency – How different a given location from it’s surrounding in colour, orientation, motion and depth
Baseline – RCNN to detect pedestrians solely from thermal images in KAIST dataset
Pedestrian detection trained using augmented images outperformed the baseline
Pixel level annotation for KAIST dataset(created by them) – for deep saliency network

Related Work

Pedestrian detection
- Traditional – handcrafted features and algorithms – ICF, ACF, LDCF
- Zhang et al. – Faster RCNN for pedestrian detection
- Sermanet et al. – multistage supervised features and skip connection
- Li et. al – Scale Aware fast RCNN with built-in subnetworks to detect pedestrians at different scales
- Brazi et. al – SDS RCNN joint supervision on pedestrian detection and semantic segmentation to illuminate pedestrians in the frame -> motivation for saliency
- Liu et al – fusion method based on faster RCNN
- Li et al – Illumination Aware Faster RCNN which adaptively integrates colour and thermal sub-networks, fuses the results using a weighting scheme depending on illumination condition
- Region Reconstruction Network – models the relation between RGB and Thermal data using a CNN, and these features to a multiscale detection network
Saliency detection
- To highlight the most conspicuous object in an image
- Traditional – global contrast, local contrast, colour and texture
- Recent works – CNNs for salient object detection
- DHSNet – first learns global saliency cues such as global contrast, objectness and compactness -> then to a novel hierarchical CNN to refine details
- Holistically-Nested edge detector – use of short connections to the skip layer structure
- Amulet – multi-level features at multiple resolutions and learn to predict saliency map by combining the features at each resolution in a recursive manner

Approach

The baseline for pedestrian detection in thermal images using faster RCNN

Faster RCNN end to end trained on thermal images

Our Approach: Using saliency maps for improving pedestrian detection

Replace one duplicate channel of the 3 channel thermal images with the corresponding saliency map

Static Saliency

Using OpenCV library. But it highlights not only pedestrians but also other objects

Deep Saliency Networks

PiCA-Net – pixel-wise contextual attention network – generates an attention map for each pixel corresponding to its relevance at each location. Uses bidirectional LSTM to scan the image horizontally and vertically around a pixel to obtain its global context. For the local context, the attention operation is performed on the local neighbouring region using convolutional layers.
UNet architecture to integrate PiCA-Net hierarchically for salient object detection.
R3Net – Uses a residual refinement block(RRB) to learn residuals between ground truth and saliency map in recursive manner.

Out Dataset: Annotating KAIST Multispectral pedestrian salient pedestrian detection.

913-day images and 789-night images(training)
manually annotate these images using the VGG Image Annotator
193-day images and 169-night images(testing)

Experiments

Datasets and Evaluation Protocols

Out of 50k training images and 45k testing images. 3 frames from training videos and 20 frames from testing. And exclude <50 pixels pedestrian instances. -> 7.6k training & 2.2k test images
Trained Deep saliency network used to create saliency map for the above train and test images
Evaluation Pedestrian detection – Log average miss rate(LAMR) over the range[10^-2,10^0] against false positives per image(FPPI). Also mAP of detection.
Evaluation of saliency detection – F-measure score(weighted harmonic mean of precision and recall). And Mean Absolute Error (MAE)

Implementation Details

Faster RCNN for pedestrian detection
- Modifications – removed 5th maxpooling layer of VGG16 backbone network.
- Original Faster RCNN used 3 scales and 3 ratios for reference anchor, but here they use 9 scales for the reference anchor b/w 0.05 and 4.
- FRCNN initialized with VGG16 weights pre-trained on ImageNet and fine-tuned for 6 epochs
- Fix the first 2 CNN layers of VGG16 and fine-tune the rest(SGD, momentum=0.9, lr=0.001, batch size =1

Deep saliency network
- Train PiCA-net and R^3-Net on thermal images with pixel-level annotations.
- PiCA:- Augmentation: random mirror flipping and random cropping. Decoder trained from scratch(lr = 0.01) encoder fine tuned(lr=0.001) for 16 epochs and decayed by 0.1 for another 16 epochs. SGD with momentum 0.9 and weight decay 0.0005, batch size =4. Resize to 224×224 by Lanczos interpolation
- R3-Net:- initialized with weight from ResNeXt network. SGD, lr=0.001, momentum=0.9, weight decay=0.0005. 9000 iterations, batch size=10

Results and Analysis

Performance of Deep Saliency Networks on our KAIST Salient Pedestrian Detection dataset
- Saliency maps generated from R3-Net post processed using CRF to improve coherence -> better results
Quantitative analysis of Pedestrian Detection in Thermal Images using Saliency Maps
- Using only thermal images: produces a miss rate of 44.2% on day images and 40.4% on night images.
- Using thermal images using static saliency maps: Day time- 39.4%(miss rate) ie 4.8% improvement(but no improvement in night time)
- Using Thermal Images with Saliency Maps generated from Deep Networks: PiCA-Net – 32.2% for day images, 21.7% for night images. And R3-Net 30.4% for day and 21% for night.
- R3-Net mAP of 68.5% during day time(6.9% improvement) and 73.2% during night time(7.7% improvement)

Qualitative analysis and effectiveness of saliency maps for Pedestrian Detection

Conclusion and Future Work

In this paper, channel replacement for augmented thermal images. suggests-> Saliency proposal stage and then jointly learn pedestrian detection and saliency detection like SDS-RCNN –
Pixel level annotation data(large amount) might give better results

Link: https://arxiv.org/abs/1904.06859