- Research
- Open Access

# Learning from label proportions for SAR image classification

- Yongke Ding
^{1}View ORCID ID profile, - Yuanxiang Li
^{2}Email author and - Wenxian Yu
^{1}

**2017**:41

https://doi.org/10.1186/s13634-017-0478-8

© The Author(s) 2017

**Received:**9 March 2016**Accepted:**15 May 2017**Published:**31 May 2017

## Abstract

Synthetic aperture radar (SAR) image classification plays a key role in SAR interpretation. Due to the cost and difficulty of truth labeling for SAR images, the newly labeled samples available for image classification are very limited. This paper focuses on defining a new sample labeling method to solve the problem of truth acquisition for training data in SAR image classification. An efficient classification framework for high-resolution SAR images is presented in this paper, which is built on learning from uncertain labels. We use grid labeling for rapid training data acquisition by assigning a label to a group of neighboring pixels at a time. A novel SVM-based learning model is proposed to optimize the uncertain training data within the constraints of label proportions in each group and then predict the label of each sample for the test data based on the optimized training set. This work intends to explore a rapid labeling method called grid labeling for efficient training set definition and apply it to large-scale SAR image classification. The model demonstrates good performance in both accuracy and efficiency for scene interpretation of high-resolution SAR images.

## Keywords

- SAR
- Image classification
- SVM
- Label proportion
- Land cover

## 1 Introduction

Synthetic aperture radar (SAR) image classification or land use/land cover (LULC) mapping plays an important role in many and diverse SAR applications [1]. Supervised learning is one of the most popular methods for SAR image classification. Supervised classification methods usually take a number of labeled samples to train the classifiers, and the classification accuracy depends on the selected training samples [2]. The acquisition of the ground truth for training data is expensive and time consuming. This is especially true for massive, high-resolution (HR) SAR data, such as TerraSAR-X, COSMO-SkyMed, etc. Due to the SAR imaging mechanism, the complex backscattering effects of objects and scenes in such HR SAR images will lead to local noises, double reflection, and even object deformation and part missing, making sample labeling highly difficult. So, researchers need to label the training samples more precisely and more carefully and even need the support of professional interpreters, field surveys, or other relevant information, such as ancillary multispectral, geographic information system (GIS), or elevation data [3, 4]. The cost and difficulty of sample labeling have highly restricted the applications of large amounts of available HR SAR data, even though the data contains rich details of objects and scenes.

Related works on efficient training set definition for remote sensing image classification are mainly based on semisupervised learning with limited or small training sets, including transfer learning (TL) and active learning (AL). TL-based classification methods are designed to achieve high classification accuracy with relatively small number of labeled samples from the new image (target domain) by efficient reuse of the training data from the previous different but relative images (source domain) [5–7]. And AL-based classification methods focus on reducing the number of samples to be labeled by a human expert through iteratively choosing the most informative (i.e., uncertain and diverse) samples from the target domain [6–8]. A common practice is to combine TL and AL in the same learning framework [6, 7]. The abovementioned approaches mainly focus on effective and efficient sample selection from the same and/or different sources of images to reduce the labeling cost. The quality of previously labeled sample will affect the classification result. And labeling for new samples is still difficult and inefficient as it is performed both in pixel-level labeling and in an iterative way. This paper tends to solve the problem of sample truth acquisition for SAR classification by exploring a new sample labeling method and the corresponding learning method.

The issue of learning from label proportions is raising attentions in the machine learning area [9–12]. For learning from label proportions, the training samples are divided into groups and label proportions of samples in each group are given as sample truth, instead of giving the label of each sample in the training set [11]. Learning from label proportion makes it possible to predict the label of each sample by labeling the training samples group by group, which shows potential for efficient classification of HR SAR images.

However, current classification methods based on label proportions [10, 11] usually come from the machine learning area, which are not suitable for SAR image classification because (1) these methods are based on detailed label proportions, that is, requiring proportions for each label in a group, which is still inefficient for sample labeling; and (2) these methods cannot perform multiclass classification directly. Although they can be extended to multiclass classification [11, 13], this also means extra computing.

The aim of this paper is to provide a unified framework with efficient sample labeling and model learning for SAR images. Inspired by the idea of learning from label proportions, we firstly introduce grid labeling to get the truth of training data more efficiently by giving the proportions of the labels in each group. Then, we present a novel support vector machine (SVM)-based learning model to eliminate the negative effects of label uncertainty from grid labeling with the support of label proportions. Moreover, an efficient inference method is proposed to approximately optimize the learning model.

The main contributions of our work are (1) introducing grid labeling for SAR image classification, to acquire the truth of the training data rapidly and at low cost; (2) building an SVM-based learning model by taking in consideration the constraints of label proportions; and (3) proposing an efficient inference method to reweight samples according to sample reliability and label proportions. Incorporation of label proportions and reweighting strategy helps to enhance the robustness of the SVM classifier with the existing uncertain (partially mislabeled) samples from grid labeling. To evaluate our method, we consider HR TerraSAR-X images of Tianjin and Rosenheim with four LULC classes [1]: urban area (UA), woodland (WL), open area (OA, such as farmland, grassland, bare soil, etc.), and water body (WB). We also have developed a software tool to implement grid labeling. First, our method outperforms the state-of-the-art approach, i.e., ∝SVM in [11]. Besides, it demonstrates closely comparable accuracy with traditional supervised classification methods working with accurate pixel labeling but greatly reduces the labeling cost of training data, making large-scale SAR image classification more realizable.

This paper is organized as follows. Section 2 presents the SAR image classification framework based on grid labeling. Section 3 introduces grid labeling concept for efficient training set definition. Section 4 describes the proposed learning model with label proportions, including an efficient approximate inference algorithm based on alternating optimization strategy. Section 5 reports experimental results on the TerraSAR-X datasets and a simulated dataset. Our work is concluded in Section 6.

## 2 The classification framework

*T*and assign a single label to each cell

*C*in

*T*, which is shared by all the pixels in the cell. So grid labeling outputs label proportions of each cell in the training data. Then at the stage of training, the SVM classifier [15, 16] with boosting strategy [17] is adopted to optimize the training data by reweighting the training samples within the constraints of label proportions. Finally, the SVM classifier predicts the label for each sample in the test data based on the optimized model parameters.

## 3 Grid labeling for rapid training set definition

### 3.1 Grid labeling concept

Grid labeling (GL) is introduced in this paper to acquire training data with high efficiency and at low cost. Instead of assigning one label to each pixel in the training set as in pixel labeling (PL), we assign a single label to each cell by grid labeling. All pixels in the cell will share the same label. Assume a proportion of the cells obtained by slicing the original SAR images are selected randomly and the training set is made up of pixels in these cells. The label for a cell *C* is determined by the major class with the maximum number of pixels in *C*. We call it “naive grid labeling” as it does not provide any other information than the label of the major class in a cell.

Although we usually accomplish pixel labeling by drawing a polygon on the image to annotate a region composed of pixels with the same class and fill it with a certain color, several factors will still make pixel labeling a hard and laborious task, such as various scales and shapes of the polygons, the unclear criterion how small a region can be ignored, and some regions with uncertain labels. Compared to traditional pixel labeling, the main advantage of grid labeling is its efficiency and operability, which includes the following: (1) Low cost and high efficiency. Grid labeling means greatly reduced cost and high efficiency, making large-scale SAR image classification more realizable; (2) Significant reduction of the difficulty in labeling HR SAR images. The backscattering effects of objects and scenes in HR SAR images are complex, making it difficult to precisely assign a label for each pixel. For grid labeling, the person in charge of labeling needs no special training and avoids expensive field surveys in the labeling work; (3) Grid labeling also emphasizes more context clues than pixel labeling, which means it is more suitable for scene understanding.

### 3.2 Proportional grid labeling

As mentioned above, naive grid labeling performs on the groups (grid cells) from directly slicing the images in spatial domain, which is an efficient and convenient way for sample labeling. But naive grid labeling provide too little information (only the label of the major class in a cell) on sample classes, which will inevitably introduce label uncertainty and eventually reduce the classification accuracy. According to the idea of learning with label proportions [9–11], we can learn a model to predict labels of the individual samples by grouping the training samples and providing proportions of the labels in each group. However, the current definition of label proportion [11] ignores spatial sources of samples in each group, which means samples in a group are not necessarily from the same local region, making it not convenient for sample labeling of remote sensing images. Thus, we introduce proportional grid labeling (proportional GL) by combining grid labeling with label proportions to implement rapid training set definition for SAR images. By estimating the proportions of each label in cell *C*, proportional GL not only can implement convenient and efficient sample labeling for SAR images as naive grid labeling but also can provide more information on sample classes.

*M*is the number of labels in \(\mathcal {L}\). Assume the training set \(T=\{{x}_{i}\}_{i=1}^{N}\) contains

*N*training samples

*x*

_{ i }. Then,

*T*can be divided into

*K*disjoint cells

*C*

_{ k }(

*k*=1,…,

*K*) where \(\bigcup \nolimits _{k=1}^{K}{{C}_{k}}=\{1,\ldots,N\}\), and {

*x*

_{ i }|

*i*∈

*C*

_{ k }} is the set of samples in

*C*

_{ k }. Label proportion is defined as the proportion of samples with a certain label in a grid cell. The label proportion for \({{l}_{i}}\in \mathcal {L}\) in cell

*C*

_{ k }is

where \({y_{j}^{*}}\) is the real pixel-level ground truth of *x*
_{
j
}, |*C*
_{
k
}| denotes the total number of samples in cell *C*
_{
k
}. Then, proportional GL can be presented in three forms by giving label proportions for each class or the major class in a cell during the sample labeling process,

**1) Multiclass GL:** The task of multiclass GL is to estimate the proportion *p*
_{
k
}(*l*
_{
i
}) for each \({{l}_{i}}\in \mathcal {L}\)(*i*=1,…,*M*) in a grid cell *C*
_{
k
} by a human expert.

**2) Majorclass GL:**Based on the definition of multiclass GL, assume \({{l}^{(k)}}=\arg \underset {{{l}_{i}}\in \mathcal {L}}{\mathop {\max }}\,{{p}_{k}}({{l}_{i}})\) is the label which has the largest (or major) proportion in

*C*

_{ k }and

*p*

_{ k }(

*l*

^{(k)}) is its proportion. Then,

*l*

^{(k)}is called the major class of cell

*C*

_{ k }. Using

*p*

_{ k }to denote

*p*

_{ k }(

*l*

^{(k)}) for short, we get,

where \({{\mathcal {L}}^{\backslash {{l}^{(k)}}}}=\left \{{{l}_{j}}\in \mathcal {L}|{{l}_{j}}\ne {{l}^{(k)}}\right \}\) is the set of all labels except *l*
^{(k)}. So the task of majorclass GL is just to give *p*
_{
k
} for major class *l*
^{(k)} in a grid cell *C*
_{
k
} by a human expert.

**3) Naive GL:** Based on the definition of majorclass GL, we can predefine *p*
_{
k
}=1 for simplicity, which also means \({{p}_{k}}\left ({{\mathcal {L}}^{\backslash {{l}^{(k)}}}}\right)=0\). Then, the human expert only needs to indicate the majorclass label *l*
^{(k)} in a grid cell *C*
_{
k
}, without estimating the proportion of any label. In this way, majorclass GL is indeed degraded to the naive grid labeling (naive GL) described in Section 3.1.

## 4 Learning from grid labeling

In this section, we focus on exploring SAR image classification method by learning from majorclass GL. The main reason we choose majorclass GL for sample labeling relies on its good balance between efficiency on sample labeling and integration with label priors. Majorclass GL is integrated with label proportions of the major class in a cell, which is the most important one among all labels. Although multiclass GL provides more information than majorclass GL, it is much more difficult to estimate the proportion for each class than the major class in sample labeling. Besides, the optimization algorithm has to satisfy the proportion of each label in the label set for each cell in order to solve a learning problem under the constraint of multiclass proportions. However, it only needs to satisfy the proportion of the majorclass label for each cell to solve a learning problem based on majorclass proportions. So it is much more difficult to solve a learning problem under the constraint of multiclass proportions than the majorclass proportions.

### 4.1 The learning model with label proportions

We build a learning model with label proportions based on the SVM classifier [16] with boosting strategy [17]. The SVM classifier takes the simple but efficient features (i.e., backscattering intensity, texture, and supertexture) proposed in [1] as its inputs and outputs the land cover types of the SAR images. Optimization of the training set from grid labeling is implemented by reweighting strategy of the boosting algorithm [17, 18], which has been improved by us through introducing a new penalty term for label proportions.

where *ω*
_{
i
} is the weight for training sample *x*
_{
i
}; *ω*=[ *ω*
_{1},…,*ω*
_{
N
}]^{
T
} is the weight vector; *y*
_{
i
} is the true label of *x*
_{
i
} from grid labeling; *L*(·)≥0 is the loss function for the traditional SVM model; another loss function *L*
_{
p
}(·)≥0 is designed to penalize the dissimilarity between the true label proportion *p*
_{
k
} and the predicted label proportion \({{\tilde {p}}_{k}}\) based on *ω*; we define \({{\tilde {p}}_{k}}(\omega)=\frac {1}{|{{C}_{k}}|}\sum \limits _{i\in {{C}_{k}}}{[\!{{\omega }_{i}}>0]}\) because a nonzero value will be assigned to *ω*
_{
i
} if the majorclass label *l*
^{(k)} is reliable to be the true label of *x*
_{
i
} (*i*∈*C*
_{
k
}), i.e., *y*
_{
i
}=*l*
^{(k)}; *φ*(*x*
_{
i
}) maps *x*
_{
i
} into a higher-dimensional space; *λ* and *λ*
_{
p
} are the regularization parameters. Our task is to simultaneously optimize the weight *ω* and the SVM model parameters **w** and *b* to ensure \({{\tilde {p}}_{k}}\) estimated using *ω* is close enough to *p*
_{
k
}.

Based on pixel-level training and label proportion constraints from grid labeling, this model not only can implement pixel-level classification for SAR images but also can improve the efficiency of sample labeling and reduce the corresponding cost. We build this model based on SVM because the training set can be optimized by reweighting the importance of each sample used for training in SVM model. This strategy has been adopted in transfer learning methods [7, 17]. And we can easily extend the number of classes for classification by defining a training set with rich classes for SVM classifier.

### 4.2 Model inference using LpcSVM algorithm

The unknown sample weight *ω* can be seen as a link between supervised learning loss *L*(·) and label proportion loss *L*
_{
p
}(·) in Eq. (4). So optimization for Eq. (4) can be solved by alternating optimization strategy [11, 19], i.e., optimizing (**w**, *b*) and *ω* one at a time by fixing the others. For fixed sample weights *ω*=[ *ω*
_{1},…,*ω*
_{
N
}]^{
T
}, the optimization of Eq. (4) w.r.t **w** and *b* becomes a classic SVM problem. When **w** and *b* are fixed, *ω* can be updated according to label agreement and label proportion constraints. The baseline strategy of weight distribution update is to assign lower weights for misclassified or unreliable samples in the training set [20]. Assume the majority of samples in the training set are correctly labeled by grid labeling. So correctly classified (reliable) training samples are associated to strong weights and misclassified samples are assigned with weak weights in the reweighting process. An updated *ω* means an updated classifier, which is expected to be more trustworthy than previous ones, since it is trained with more reliable samples.

*ω*

_{ i }=1(

*i*=1,…,

*N*), we obtain

**w**and

*b*through the traditional SVM training,

*y*

_{ i }=

*l*

^{(k)}for

*x*

_{ i }(

*i*∈

*C*

_{ k }) as the major class is used for labeling all the samples in

*C*

_{ k }during the training process. When

**w**and

*b*are obtained, the problem of Eq. (4) becomes

*ω*

_{ i }(

*i*∈

*C*

_{ k }) on each cell separately as the influence of each cell

*C*

_{ k }on the objective function in Eq. (6) is independent. This means Eq. (6) can be converted into a cell-wise optimization form,

where the absolute loss \({{L}_{p}}\left ({{\tilde {p}}_{k}}(\omega),{{p}_{k}}\right)=|{{\tilde {p}}_{k}}(\omega)-{{p}_{k}}|\) is used in this paper. We adopt an approximate cell-wise optimization method for estimating weight *ω*
_{
i
} in Eq. (7).

*R*(

*l*

^{(k)}|

*x*

_{ i }), a measure for sample reliability or label agreement, that is, how likely

*x*

_{ i }will be classified as

*y*

_{ i }=

*l*

^{(k)}by the classifier. It is defined as

where *E*(*l*|*x*
_{
i
})=− log(*P*(*l*|*x*
_{
i
})). The class posterior distribution *P*(*l*|*x*
_{
i
}) is estimated by LIBSVM [16]. LIBSVM uses the method in [21] to estimate class probability for classification problem. The basic idea of the method is to estimate pairwise (i.e., one-against-one) class probability from decision values (or cost values) and then solve an optimization problem to get multi-class probability. Compared to the simple potential form *E*(*l*
^{(k)}|*x*
_{
i
}), the definition in Eq. (8) is expected to better depict the potential of *x*
_{
i
} to be classified as *l*
^{(k)}, by measuring the distance between the potential of label *l*
^{(k)} and the minimal value among all the other labels (i.e., the label most likely to replace *l*
^{(k)}). Smaller value of *R*(*l*
^{(k)}|*x*
_{
i
}) means *l*
^{(k)} is more reliable to reflect the true label of *x*
_{
i
}.

Then, we sort *x*
_{
i
}(*i*∈*C*
_{
k
}) in ascending order by *R*(*l*
^{(k)}|*x*
_{
i
}) and note the sorted sample sequence as \(\phantom {\dot {i}\!}{{x}_{{{\delta }_{i}}}}\), where *δ*
_{
i
} is the index value (or position) of *x*
_{
i
} in the sorted sequence.

*x*

_{ i }(

*i*∈

*C*

_{ k }) using the following reweighting formula,

*N*

_{ s }=int(

*p*

_{ k }|

*C*

_{ k }|) is the sample number of major class

*l*

_{ k }in

*C*

_{ k }; \({{N}_{m}}=\frac {\left | {{C}_{k}} \right |}{M}\) is the number of samples to be assigned with weight 1; and

*θ*is a parameter to control the degree of weight weakening for

*N*

_{ m }<

*δ*

_{ i }≤

*N*

_{ s }. Typically,

*θ*=0.5 will work well in most cases. According to the formula, we assign strong weight (

*ω*

_{ i }>0) for the first

*N*

_{ s }samples with smallest

*R*(

*l*

^{(k)}|

*x*

_{ i }) (where

*i*satisfies

*δ*

_{ i }≤

*N*

_{ s }) and zero weight for the others. To further enhance the robustness of our method to the estimation error of label proportion

*p*

_{ k }, the sample weight is designed to introduce a penalty term for

*N*

_{ m }<

*δ*

_{ i }≤

*N*

_{ s }, instead of simply assigning

*ω*

_{ i }=1 for all

*δ*

_{ i }≤

*N*

_{ s }, as illustrated in Fig. 3.

*N*

_{ s }is used to control the difference between the predicted and true label proportions by setting the number of the predicted majorclass samples to that of the true majorclass samples in

*C*

_{ k }explicitly. According to \({{\tilde {p}}_{k}}(\omega)\) defined before, we have

*N*

_{ s }=int(

*p*

_{ k }|

*C*

_{ k }|), we get \({{\tilde {p}}_{k}}(\omega)\approx {{p}_{k}}\). Then \({{L}_{p}}=|{{\tilde {p}}_{k}}(\omega)-{{p}_{k}}|\approx 0\), which means the loss function in Eq. (7) is minimized to its first term, i.e., the weighted sum of the loss values of the chosen

*N*

_{ s }samples. So we have

where we use *L*(*x*
_{
i
}) to denote *L*(*y*
_{
i
},**w**
^{
T
}
*φ*(*x*
_{
i
})+*b*) for short.

Let *ω*
^{′}
_{
i
} be the updated version of sample weights *ω*
_{
i
} using the reweighting strategy in Eq. (9). Then, *ω*
^{′}
_{
i
} is decreasing (non-increasing) with *δ*
_{
i
} (see Fig. 3). Recall that \(\phantom {\dot {i}\!}\{{{x}_{{{\delta }_{i}}}}\}\) is the sample sequence after sorting { *x*
_{
i
}} in ascending order of the potential \(R\left ({{l}^{(k)}},{{x}_{{\delta }_{i}}}\right)\), which means \(R\left ({{l}^{(k)}},{{x}_{{\delta }_{i}}}\right)\) is increasing (non-decreasing) with *δ*
_{
i
}. For a sample \(\phantom {\dot {i}\!}{{x}_{{{\delta }_{i}}}}\), a smaller value of the potential \(\phantom {\dot {i}\!}R\left ({{l}^{(k)}},{{x}_{{{\delta }_{i}}}}\right)\) indicates a greater probability of *l*
^{(k)} to be the true label of \(\phantom {\dot {i}\!}{{x}_{{\delta }_{i}}}\), which also means a smaller value of sample cost \(\phantom {\dot {i}\!}L({{x}_{{\delta }_{i}}})\) in Eq. (7). So \(\phantom {\dot {i}\!}L({{x}_{{\delta }_{i}}})\) is also increasing (non-decreasing) with *δ*
_{
i
}.

*ω*

_{ i }} and {

*L*(

*x*

_{ i })} are in opposite orders. Since {

*ω*

^{′}

_{ i }} is in descending order by

*δ*

_{ i }and \(\phantom {\dot {i}\!}\{L({{x}_{{{\delta }_{i}}}})\}\) is in ascending order by

*δ*

_{ i }, we have \(\underset {\{{{\omega }_{i}}|i\in {{C}_{k}}\}}{\mathop {\min }}\,\sum \limits _{i\in {{C}_{k}}}^{{}}{{{\omega }_{i}}L\left ({{x}_{i}} \right)}=\sum \limits _{i\in {{C}_{k}}}{{{{\omega }'}_{i}}L({{x}_{{{\delta }_{i}}}})}\). Finally, Eq. (7) is minimized as

which means the cost function in Eq. (7) is minimized by setting higher weight *ω*
^{′}
_{
i
} to lower sample cost \(\phantom {\dot {i}\!}L({{x}_{{\delta }_{i}}})\) using our LpcSVM algorithm.

Besides, we choose \({{N}_{m}}=\frac {\left | {{C}_{k}} \right |}{M}\) in Eq. (7) for the following reason. As there are *M* labels for classification, the infimum of *p*
_{
k
} is \(\inf ({{p}_{k}})=\frac {1}{M}\). We get \(\inf ({{N}_{s}})=\inf ({{p}_{k}})\times \left | {{C}_{k}} \right |=\frac {\left | {{C}_{k}} \right |}{M}\). Despite possible noises in *p*
_{
k
}, at least \({{N}_{m}}=\frac {\left | {{C}_{k}} \right |}{M}\) samples in each cell should be made full use of by assigning weight 1 to them. A larger value of *δ*
_{
i
} means a less reliable sample. In our reweighting strategy, weakened weights are assigned to less reliable ones (i.e., *N*
_{
m
}<*δ*
_{
i
}≤*N*
_{
s
}) among the *N*
_{
s
} samples chosen for the next iteration (see Fig. 3). Then, small fluctuation of *N*
_{
s
} will not cause considerable changes in the weight distribution, making the method of better robustness to possible estimation errors in label proportions. Experimental performance of the robustness of our method to the estimation error in label proportions will be described in Section 5.4.

By combining label agreement [17] and label proportion similarity between the predicted and true labels, our reweighting strategy will not only weaken the role of unreliable samples in the training set *T* but also keep the important role of a certain number of reliable samples within the constraints of label proportions for the major class in each cell.

The updated *ω* is used to retrain the SVM classifier. After a number of iterations, strong weights will be assigned to the reliable samples and weak weights to the others, producing an optimized pixel-level training set (in form of optimized sample weights). Finally, the classifier will predict the label for each sample in the test data using the optimized model parameters.

Although partially inspired by the idea of ∝SVM [11], our model is essentially different with ∝SVM in the following respects: (1) ∝SVM performs on multiclass label proportions, while our model performs on majorclass label proportion, which makes sample labeling more efficient; (2) ∝SVM only deals with binary classification, while our model deals directly with multi-class classification; (3) Besides, our model is optimized based on sample reweighting, which is totally different with ∝SVM and also makes multiclass classification realizable.

## 5 Experiments

### 5.1 Dataset description and experiment setting

Datasets used to test the methods

Site | Imaging mode | Pixel posting | Size |
---|---|---|---|

Tianjin | Stripmap | 2.0 m | 8911 ×8787 |

Rosenheim | Spotlight | 1.25 m | 9504 ×8330 |

The SVM classifier is configured with radial basis function (RBF) kernel and probability estimation [16]. In our experiment, the parameter *λ* in Eq. (7) is tuned from {0.1, 1.0, 10} and we found *λ*=1.0 works better for our test. The parameter *λ*
_{
p
} in Eq. (7) is tuned from { *λ*, 10 *λ*, 100 *λ*, 1000 *λ*}. A larger value of *λ*
_{
p
} means introducing more penalty from label proportions. Theoretically, the parameter *θ* in Eq. (9) controls the penalty on the less reliable training samples with *N*
_{
m
}<*δ*
_{
i
}≤*N*
_{
s
}. A larger *θ* means assigning larger weights to the corresponding samples. In our test, we tune *θ* from {0.3, 0.4, 0.5, 0.6, 0.7} and found *θ*=0.5 works better for our experiment. We use efficient low-level features such as backscattering intensity, texture, and supertexture [1] for the SVM classifier to take as input. The features are defined as follows.

*i*. Suppose we define a patch

*i*centered at pixel

*i*, which is actually a local region at pixel

*i*for texture measuring. Both texture and supertexture features are extracted on patches. The texture feature \(f_{i}^{tex}\) represents the coefficient of variation in a patch, that is, \(f_{i}^{tex}=\sigma _{i}^{tex}/\mu _{i}^{tex}\), where \(\mu _{i}^{tex}\) gives the mean backscattering intensity value and \(\sigma _{i}^{tex}\) represents the standard deviation of the backscattering intensity for all the pixels in patch

*i*. In our test, we adopt a patch of 11 ×11 pixels, which will cover a block with homogeneous texture (e.g., a residential building) under the resolution of our test images (around 1.5 m). The supertexture feature \(f_{i}^{sup}\) measures the similarity of textures between neighboring patches, thus providing clues for texture context of neighboring patches. For patch

*i*, supertexture is defined as

where \(\mu _{i}^{sup}\) is the average of \(f_{j}^{tex}\) and \(\sigma _{i}^{sup}\) is the standard deviation of \(f_{j}^{tex}\) in the neighborhood *N*(*i*) of patch *i*, *n* is the number of patches in *N*(*i*). A 5 ×5 rectangular neighborhood is exploited in our work for an appropriate measurement of scene homogeneity and texture context (such as a piece of residential area or green field). Then, the feature vector \({{f}_{i}}={{\left [f_{i}^{I},f_{i}^{tex},f_{i}^{sup}\right ]}^{T}}\) for each pixel *x*
_{
i
} is taken as input for the SVM classifier.

We have designed four groups of experiments to test our method from different aspects. The first is to show overall experiment results by comparing our method to the baseline methods. The second is to analyze the classification performance of our method with cell size and iteration varying. The third is to explore the robustness of our method to *p*
_{
k
} estimated in majorclass GL. The last is to further apply our method to classification of a simulated dataset. Besides, we also discuss the efficiency and cost advantages of our method over traditional methods at the end of this section.

### 5.2 Overall experimental results

Overall classification accuracy for the SAR images (%)

Tianjin | Rosenheim | |||
---|---|---|---|---|

Method | Overall accuracy | Kappa coefficient | Overall accuracy | Kappa coefficient |

PL+SVM | 87.41 ± 0.43 | 0.7762 ± 0.0132 | 68.17 ± 0.59 | 0.5880 ± 0.0091 |

GL+SVM | 83.13 ± 0.47 | 0.6698 ± 0.0104 | 64.52 ± 0.61 | 0.4623 ± 0.0324 |

∝SVM | 84.86 ± 0.53 | 0.7144 ± 0.0121 | 65.92 ± 0.95 | 0.4847 ± 0.0083 |

GL+LpcSVM with | 86.38 ± 0.68 | 0.7278 ± 0.0131 | 67.34 ± 0.55 | 0.5728 ± 0.0079 |

GL+LpcSVM with | 86.33 ± 0.74 | 0.7226 ± 0.0143 | 67.14 ± 0.53 | 0.5719 ± 0.0077 |

As can be seen from Table 2, GL+SVM does not perform so well due to the uncertain labels introduced by grid labeling. But ∝SVM performs better than GL+SVM through learning from label proportions. And our method (GL+LpcSVM with and without noise *n*
_{
k
} in *p*
_{
k
}) outperforms ∝SVM in classification accuracy with about 1.5%. Moreover, ∝SVM cannot deal with multiclass classification directly, which means extra computing will be needed for ∝SVM to deal with multiclass classification. And ∝SVM is based on detailed label proportions, which also means more cost than LpcSVM (based on majorclass GL) in sample labeling. Although it is unfair to compare our method with PL+SVM for the latter is based on the accuracy training set from pixel labeling, we still take PL+SVM as a baseline method. GL+LpcSVM (with and without noise *n*
_{
k
} in *p*
_{
k
}) demonstrates comparable classification accuracy and Kappa coefficient with PL+SVM, while reducing labeling cost significantly by adopting grid labeling.

### 5.3 Classification performance with cell size and iteration varying

### 5.4 Robustness to *p*
_{
k
} estimated in majorclass GL

*p*

_{ k }. As we have introduced a penalty of weights at the end of the sequence of

*N*

_{ s }samples chosen for the next training in the reweighting formula of Eq. (9), small errors in

*p*

_{ k }will not degrade the performance of our method. Influence of label proportion errors on classification accuracy for the two SAR images is presented in Fig. 7. The test is performed with a cell size of 200 ×200 and four iterations. We simulate the errors in label proportion by adding noise

*n*

_{ k }of a normal distribution with mean

*μ*=0 and standard deviation

*σ*=0.05∼0.20 to the existing label proportion

*p*

_{ k }. Then, the noisy label proportion is

*p*

_{ k }=

*p*

_{ k }+

*n*

_{ k }, where

*n*

_{ k }∼

*N*(

*μ*,

*σ*

^{2}). According to the three-sigma rule, the value of

*n*

_{ k }will basically fall in [

*μ*−3

*σ*,

*μ*+3

*σ*], e.g., [-0.15, 0.15] for

*σ*=0.05. As can be seen from the figure, our method still maintains a relative high level of classification accuracy under the influence of errors in label proportions.

In order to estimate the standard deviation value of actual label proportion errors, grid labeling is performed on nearly 4000 cell regions from the two full-scene TerraSAR-X images used in the experiment. Five persons participate in the grid labeling stage, and each person labels each image twice. The order of cell regions in each image is randomly generated for each labeling by each person. So we obtain ten groups of different labelings for the cell regions in each image. Based on our grid labeling activities, we found the standard deviation value of label proportion errors in grid labeling is basically around *σ*=0.05, which means the value of label proportion errors falls in the confidence interval [-0.15, 0.15] with the probability of 99.7%, based on the three-sigma rule mentioned above. However, our method can still provide desirable classification results (85.92% for Tianjin and 67.06% for Rosenheim) even under a noise with *σ*=0.10 (see Fig. 7).

Besides, if no label proportions are provided (i.e., *p*
_{
k
}=1, *k*=1,…,*K* as defined in Naive GL), our method can still work relatively well, outperforming GL+SVM about 1% in classification accuracy (see the case of *σ*= max in Fig. 7). In this condition, all the samples in cell *C*
_{
k
} are selected for the next iteration (i.e., *N*
_{
s
}=|*C*
_{
k
}|), but they are reweighted differently according to the reliability of their labels, without the support of label proportions.

### 5.5 Classification with the simulated dataset

The dataset of a simulated SAR image is also used to further test our method. In order to simulate a SAR image of a large scene for classification, we adopt a widely used speckle-statistics-based simulation method as in Section 4.1.2 of [24]. Four types of homogeneous areas are considered to represent four land cover types in the simulated SAR image, and the pixels in each type of homogeneous areas follow a Rayleigh distribution. We implement each type of homogeneous area by setting the real and imaginary components of each pixel in the area to follow an independent and identical Gaussian (normal) distribution with zero mean and a variance *σ*
^{2}. Then, the amplitude *A* of the pixels follows a Rayleigh probability density \(p(A)=\frac {A}{{\sigma }^{2}}\exp \left (-\frac {{A}^{2}}{2{{\sigma }^{2}}} \right),A\ge 0\) with its mean \(\sigma \sqrt {\frac {\pi }{2}}\) and the variance \(\frac {4-\pi }{2}{{\sigma }^{2}}\).

We s et *σ*=50, 110, 130, 150 respectively to generate the four types of homogeneous areas (from dark to bright). The truth map of the TerraSAR-X image of Rosenheim is used for spatial configuration (such as locations, sizes, and shapes) of different homogeneous areas in the simulated image. Finally, we obtain a simulated image with size of 9504 ×8330, which contains four types of homogeneous areas to represent four different land cover types.

Overall classification accuracy for the simulated image (%)

Method | Overall accuracy | Kappa coefficient |
---|---|---|

PL+SVM | 89.32 ± 0.34 | 0.8385 ± 0.0102 |

GL+SVM | 86.83 ± 0.42 | 0.8011 ± 0.0154 |

∝SVM | 87.03 ± 0.36 | 0.8056 ± 0.0107 |

GL+LpcSVM with | 89.08 ± 0.44 | 0.8346 ± 0.0125 |

GL+LpcSVM with | 88.97 ± 0.52 | 0.8331 ± 0.0121 |

### 5.6 Efficiency and cost advantages

As mentioned above, the classification accuracy of our method (GL+LpcSVM) is slightly lower than PL+SVM as the latter is based on the accuracy training set from pixel labeling. But we emphasize the efficiency and cost advantages of our method over PL+SVM. For efficiency measuring, we consider the total classification time *t* as the sum of training set definition time *t*
_{1} and time consuming *t*
_{2} of the SVM classifier. Even using polygon annotation, the efficiency of pixel labeling will still be very low due to various scales and shapes of the polygons, the unclear criterion how small a region can be ignored, and some regions with uncertain labels. It will be shown that the efficiency of classification can be improved significantly through the great reduction of training set definition time *t*
_{1} based on our method.

*t*

_{1}is the time of pixel labeling, and for GL+LpcSVM, it is the time of grid labeling using the developed labeling tool. From the table, we can see that

*t*

_{1}is much larger than

*t*

_{2}for both methods. Our method makes the total classification time

*t*reduced greatly by decreasing the training set definition time

*t*

_{1}greatly and keeping the LpcSVM inference algorithm reaching stable within just three or four iterations (see Fig. 6), which means a slightly increased

*t*

_{2}.

Average time used for classification of Tianjin and Rosenheim areas (min)

Method | Training set | Classifier | Total time |
---|---|---|---|

definition time | consuming time | ||

PL+SVM | 514 | 22 | 536 |

GL+LpcSVM | 128 | 37 | 165 |

In addition, the cost of land cover classification has been reduced significantly with the reduction of training set definition time. More importantly, expensive field surveys can be avoided for sample labeling, making our method suitable for practical large-scale SAR image understanding.

## 6 Conclusions

This paper presents an efficient SAR image classification method through learning from label proportions. Grid labeling is introduced to obtain the truth of training data more efficiently. In order to eliminate label uncertainty coming from grid labeling, we present an SVM-based model for learning from label proportions. A proximate inference algorithm is proposed through defining a reweighting strategy which has considered label agreement and label proportions. Our method not only outperforms ∝SVM, a state-of-the-art approach on learning from label proportions, but also demonstrates comparable accuracy with traditional classification methods working with pixel labeling, while bringing great reduction of labeling cost for training data. Future work includes developing an automatic estimation method for label proportions to further reduce human labor in sample labeling and extending our method to more application scenarios, such as SAR urban LULC mapping [3, 14] and multispectral image classification.

## Declarations

### Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant 61331015 and U1406404.

### Competing interests

The authors declare that they have no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- Y Ding, Y Li, W Yu, SAR image classification based on CRFs with integration of local label context and pairwise label compatibility. IEEE J. Selected Topics Appl. Earth Observ. Remote Sensing.
**7**(1), 300–306 (2014). doi:10.1109/JSTARS.2013.2262038.View ArticleGoogle Scholar - JA Richards, X Jia,
*Remote sensing digital image analysis: an introduction*(Springer, Berlin, 2006).Google Scholar - P Gamba, M Aldrighi, SAR data classification of urban areas by means of segmentation techniques and ancillary optical data. IEEE J. Selected Topics Appl. Earth Observ. Remote Sensing.
**5**(4), 1140–1148 (2012). doi:10.1109/JSTARS.2012.2195774.View ArticleGoogle Scholar - C Tison, F Tupin, H Maitre, A fusion scheme for joint retrieval of urban height map and classification from high-resolution interferometric SAR images. IEEE Trans. Geosci. Remote Sensing.
**45**(2), 496–505 (2007). doi:10.1109/TGRS.2006.887006.View ArticleGoogle Scholar - S Rajan, J Ghosh, MM Crawford, Exploiting class hierarchies for knowledge transfer in hyperspectral data. IEEE Trans. Geosci. Remote Sensing.
**44**(11), 3408–3417 (2006). doi:10.1109/TGRS.2006.878442.View ArticleGoogle Scholar - C Persello, L Bruzzone, Active learning for domain adaptation in the supervised classification of remote sensing images. IEEE Trans. Geosci. Remote Sensing.
**50**(11), 4468–4483 (2012). doi:10.1109/TGRS.2012.2192740.View ArticleGoogle Scholar - C Persello, Interactive domain adaptation for the classification of remote sensing images using active learning. IEEE Geosci. Remote Sensing Lett.
**10**(4), 736–740 (2013). doi:10.1109/LGRS.2012.2220516.View ArticleGoogle Scholar - B Demir, F Bovolo, L Bruzzone, Classification of time series of multispectral images with limited training data. IEEE Trans. Image Process.
**22**(8), 3219–3233 (2013). doi:10.1109/TIP.2013.2259838.View ArticleGoogle Scholar - N Quadrianto, AJ Smola, TS Caetano, QV Le, Estimating labels from label proportions. J. Mach. Learn. Res.
**10:**, 2349–2374 (2009).MathSciNetMATHGoogle Scholar - S Rueping, in
*Proceedings of the 27th International Conference on Machine Learning*. SVM classifier estimation from group probabilities (ACMHaifa, 2010), pp. 911–918.Google Scholar - F Yu, D Liu, S Kumar, T Jebara, S Chang, in
*Proceedings of the 30rd International Conference on Machine Learning*. ∝SVM for learning with label proportions (ACMAtlanta, 2013), pp. 504–512.Google Scholar - G Patrini, R Nock, T Caetano, P Rivera, in
*Advances in Neural Information Processing Systems 27*, ed. by Z Ghahramani, M Welling, C Cortes, ND Lawrence, and KQ Weinberger. No label no cry (Curran Associates, Inc.North Miami Beach, 2014), pp. 190–198.Google Scholar - SS Keerthi, S Sundararajan, S Shevade, in
*Proceeding of the 24th International Conference on Computational Linguistics*. Extension of TSVM to multi-class and hierarchical text classication problems with general losses (Association for Computational LinguisticsMumbai, 2012), pp. 1091–1100.Google Scholar - Y Ding, L Qiu, P Yang, Z Zhou, Y Li, W Yu, in
*2013 IEEE International Geoscience and Remote Sensing Symposium (IGARSS)*. Scene scattering descriptor for urban classification in very high resolution SAR images, (IEEEMelbourne, 2013), pp. 2015–2018, doi:10.1109/IGARSS.2013.6723205.View ArticleGoogle Scholar - C Cortes, V Vapnik, Support-vector networks. Mach. Learn.
**20**(3), 273–297 (1995). doi:10.1007/BF00994018.MATHGoogle Scholar - C-C Chang, C-J Lin, LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol.
**2**(3), 1–27 (2011). doi:10.1145/1961189.1961199.View ArticleGoogle Scholar - W Dai, Q Yang, G-R Xue, Y Yu, in
*Proceedings of the 24th International Conference on Machine Learning*. Boosting for transfer learning (ACMCorvallis, 2007), pp. 193–200, doi:10.1145/1273496.1273521.Google Scholar - Y Freund, RE Schapire, A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci.
**55**(1), 119–139 (1997). doi:10.1006/jcss.1997.1504.MathSciNetView ArticleMATHGoogle Scholar - ZH Zhou,
*Ensemble methods: foundations and algorithms*(Taylor & Francis, New York, 2012).Google Scholar - G Jun, J Ghosh, in
*2008 IEEE International Geoscience and Remote Sensing Symposium (IGARSS)*, 1. An efficient active learning algorithm with knowledge transfer for hyperspectral data analysis (IEEEBoston, 2008), pp. 52–55, doi:10.1109/IGARSS.2008.4778790.Google Scholar - T-F Wu, C-J Lin, RC Weng, Probability estimates for multi-class classification by pairwise coupling. J. Mach. Learn. Res.
**5:**, 975–1005 (2004).MathSciNetMATHGoogle Scholar - GH Hardy, JE Littlewood, G Pólya,
*Inequalities, Cambridge Mathematical Library. Section 10.2, Theorem 368*(Cambridge University Press, Cambridge, 1952).Google Scholar - D Dai, W Yang, H Sun, Multilevel local pattern histogram for SAR image classification. IEEE Geosci. Remote Sensing Lett.
**8**(2), 225–229 (2011). doi:10.1109/lgrs.2010.2058997.View ArticleGoogle Scholar - JS Lee, E Pottier,
*Polarimetric radar imaging: from basics to applications*(CRC Press, Boca Raton, 2009).View ArticleGoogle Scholar