Kolya Malkin, Caleb Robinson, Le Hou, Rachel Soobitsky, Jacob Czawlytko, Dimitris Samaras, Joel Saltz, Lucas Joppa, Nebojsa Jojic [Microsoft Research, Yale University, Georgia Institute of Technology, Stony Brook University, Chesapeake Conservancy] – ICLR 2019
Abstract
- DL method – Low-resolution labels to high-resolution labels given joint distribution between low and high-resolution labels
- Noval loss function – minimizes the distance b/w distributions determined by a set of model outputs & corresponding distributions by low-resolution labels over the same set of outputs(?)
- Class matching is not required. Also can be applied in a high-resolution semantic separation where HR labelled data is not available.
- More performance than models trained only on HR labels
Introduction
- Semantic segmentation – labeling each pixel of input image X = {xij}to L classes Y = {yij }, y ∈ {1, . . . , L}(application classes)
- Weakly supervised segmentation – partial observation of target groundtruth labels, eg: summary of class labels instead of pixel-level labels
- Low resolution classes – Z = {zk }; z ∈ {1, . . . , N }(accessory class)- defined for a set of pixels in i/p image
- Joint distribution P(Y,Z)
- Training image X is divided into K sets Bk each with accessory class label zk& model trained to produce HR label yij
- Contribution: the creation of a general solution in weakly supervised image segmentation. (for land cover mapping and lymphocyte segmentation from pathology imagery)
- State of the art methods disadvantages:
- Dai et al., Papandreou et al. – create bounding boxes around land cover object instances
- Kra ̈henbu ̈hl & Koltu, Hong et al.- match a class density functions to weak labels
- Lempitsky & Zisserman – localization and enumeration of small foreground objects with known sizes
- Chen et al. – expensive steps in inferences(CRF or iterative evaluation) – impractical in a large dataset
- Proposed Method: segmentation network will output probabilistic estimates of the application labels(HR) & summarizes over the sets Bk-> estimated distribution of application labels(HR) for each set. These are compared with LR labels using std distribution distance metrics
- 1st Contribution: label SR network which utilizes the distribution of HR labels suggested by given LR labels, based on visual cues in the input images.
- 2nd Contribution: method utilizes more training data with weak labels
Converting a Semantic Segmentation Network into a Label Super-Resolution Network
- φ- learned network parameters. Semantic segmentation distribution can be factorized as p(Y |X ; φ) = i,j???? p(yij |X ; φ), each p(yij |X ; φ) is a distribution over the possible labels y ∈ {1, . . . , L}.
- The network trained on pairs of observed training images and label images (Xt,Yt) to maximize: φ = argmaxφ tlog p(Yt|Xt ;φ) = argmaxφti,j logp(ytij|Xt ;φ)
- Assumption:
- Don’t have pixel level supervision(Yt) but have LR labels zk ∈{1,…,N}given on sets(blocks) of input pixels(Bk)
- A statistical joint distribution over the number of pixels clof each HR labell ∈ {1, . . . , L}, over the LR label z, pcoarse(c1, c2, . . . , cL|z)
- Semantic segmentation network
- Using coarse labels as statistical descriptors:
- Coarse labels can provide weak supervision by dividing block of pixels into categories that are statistically different from each other. For that we need to represent HR pixel count in these blocks, pcoarse(c|z)
- Label counting:
- pcoarse(c|z)- a connection between coarse and fine labels
- pnet(cl,k=c/X) – Gaussian distribution
- p(Y |X)- model that outputs distributions over HR labels given input X
- Label count will follow a Gaussian distribution(since avg many RVs)
- Must summarize the model output over LR block Bk
- Label counting layer computes a statistical representation
- Statistics matching loss:
- Computes the amount of mismatch b/w 2 distributions, D(pnet,pcoarse), which then use as an optimization criterion for segmentation.
- Using coarse labels as statistical descriptors:
Applications and Experiments
- Land Cover Super-Resolution:
- Land cover classified data difficult and expensive to acquire at high resolution.
- This work implemented an automated landcover change detection using it’s model
- Dataset and training: 3 goals.
- (1). show how working of models trained only on low resolution data and label super-resolution compared to models with high resolution training data.
- (2). show how models trained on label super-resolution works in heterogeneous land-cover data(urban area).
- (3). effect of utilization of low and high resolution labels
- 3 datasets. –
- 4-channel HR(1m) aerial images from US department of agriculture.
- HR(1m) land-cover data covering bay watershed.
- LR(30m) land-cover from NLCD
- divided the data to 4 geographical area. 1 training region with HR labels and 3 test regions
- train and test 4 groups of models.
- HR model – which only have access to HR data
- SR model – trained with their label super resolution model, that only have access to low resolution labels in which they are tested.
- Baseline weakly supervised model – only have access to low resolution labels
- HR + SR model – have access to both HR and LR
- Baseline models:
- HR base model – U-Net core trained to minimize pixelwise cross-entropy loss using HR labels(UNet won against SegNet, ResNet & full-resolution ResNet)
- Soft naive – NLCD mean frequency as target label for every pixel
- Hard naive – using a one-hot vector of most frequent label
- EM approach –
- M-Step: train the SR model only
- E-Step: perform inference of HR label on training set, then super pixel denoising. Then assign labels in each block according to this smoothed prediction
- Repeat EM iteration.
- This paper uses superpixel denoising instead of dense-CRF for computational efficiency
Conclusions
- SR Network, capable of deriving HR labels from low-resolution labels under the assumption that the joint distribution between joint distribution between LR and HR classes is known.
Link: https://openreview.net/pdf?id=rkxwShA9Ym