In this section, we focus on exploring SAR image classification method by learning from majorclass GL. The main reason we choose majorclass GL for sample labeling relies on its good balance between efficiency on sample labeling and integration with label priors. Majorclass GL is integrated with label proportions of the major class in a cell, which is the most important one among all labels. Although multiclass GL provides more information than majorclass GL, it is much more difficult to estimate the proportion for each class than the major class in sample labeling. Besides, the optimization algorithm has to satisfy the proportion of each label in the label set for each cell in order to solve a learning problem under the constraint of multiclass proportions. However, it only needs to satisfy the proportion of the majorclass label for each cell to solve a learning problem based on majorclass proportions. So it is much more difficult to solve a learning problem under the constraint of multiclass proportions than the majorclass proportions.
4.1 The learning model with label proportions
We build a learning model with label proportions based on the SVM classifier [16] with boosting strategy [17]. The SVM classifier takes the simple but efficient features (i.e., backscattering intensity, texture, and supertexture) proposed in [1] as its inputs and outputs the land cover types of the SAR images. Optimization of the training set from grid labeling is implemented by reweighting strategy of the boosting algorithm [17, 18], which has been improved by us through introducing a new penalty term for label proportions.
We formulate the SVM-based model under the large-margin framework as below [7, 11],
$$\begin{array}{*{20}l} & {}\underset{\mathbf{w},b,\omega}{\mathop{\min}}\,\!\frac{1}{2}{{\mathbf{w}}^{T}}\mathbf{w}\,+\,\lambda\! \sum\limits_{i=1}^{N}\!{{{\omega}_{i}}L\!\left({{y}_{i}},{{\mathbf{w}}^{T}}\varphi({{x}_{i}})\,+\,b \right)} \,+\,\!{{\lambda}_{p}}\!\sum\limits_{k=1}^{K}\!{{{L}_{p}}\!\left({{{\tilde{p}}}_{k}}\!(\omega)\!,{{p}_{k}} \!\right)} \\ & {}\mathrm{s}.\text{t}.\;{{\omega}_{i}}\in [\!0,1],i=1,\ldots,N \end{array} $$
(4)
where ω
i
is the weight for training sample x
i
; ω=[ ω
1,…,ω
N
]T is the weight vector; y
i
is the true label of x
i
from grid labeling; L(·)≥0 is the loss function for the traditional SVM model; another loss function L
p
(·)≥0 is designed to penalize the dissimilarity between the true label proportion p
k
and the predicted label proportion \({{\tilde {p}}_{k}}\) based on ω; we define \({{\tilde {p}}_{k}}(\omega)=\frac {1}{|{{C}_{k}}|}\sum \limits _{i\in {{C}_{k}}}{[\!{{\omega }_{i}}>0]}\) because a nonzero value will be assigned to ω
i
if the majorclass label l
(k) is reliable to be the true label of x
i
(i∈C
k
), i.e., y
i
=l
(k); φ(x
i
) maps x
i
into a higher-dimensional space; λ and λ
p
are the regularization parameters. Our task is to simultaneously optimize the weight ω and the SVM model parameters w and b to ensure \({{\tilde {p}}_{k}}\) estimated using ω is close enough to p
k
.
Based on pixel-level training and label proportion constraints from grid labeling, this model not only can implement pixel-level classification for SAR images but also can improve the efficiency of sample labeling and reduce the corresponding cost. We build this model based on SVM because the training set can be optimized by reweighting the importance of each sample used for training in SVM model. This strategy has been adopted in transfer learning methods [7, 17]. And we can easily extend the number of classes for classification by defining a training set with rich classes for SVM classifier.
4.2 Model inference using LpcSVM algorithm
The unknown sample weight ω can be seen as a link between supervised learning loss L(·) and label proportion loss L
p
(·) in Eq. (4). So optimization for Eq. (4) can be solved by alternating optimization strategy [11, 19], i.e., optimizing (w, b) and ω one at a time by fixing the others. For fixed sample weights ω=[ ω
1,…,ω
N
]T, the optimization of Eq. (4) w.r.t w and b becomes a classic SVM problem. When w and b are fixed, ω can be updated according to label agreement and label proportion constraints. The baseline strategy of weight distribution update is to assign lower weights for misclassified or unreliable samples in the training set [20]. Assume the majority of samples in the training set are correctly labeled by grid labeling. So correctly classified (reliable) training samples are associated to strong weights and misclassified samples are assigned with weak weights in the reweighting process. An updated ω means an updated classifier, which is expected to be more trustworthy than previous ones, since it is trained with more reliable samples.
The alternating optimization procedure of Eq. (4) is summarized in Algorithm 1, which is called label-proportion-constrained SVM (LpcSVM) algorithm. By initializing ω
i
=1(i=1,…,N), we obtain w and b through the traditional SVM training,
$$\begin{array}{@{}rcl@{}} \underset{\mathbf{w},b}{\mathop{\min}}\,\frac{1}{2}{{\mathbf{w}}^{T}}\mathbf{w}+\lambda \sum\limits_{i=1}^{N}{{{\omega}_{i}}L\left({{y}_{i}},{{\mathbf{w}}^{T}}\varphi({{x}_{i}})+b \right)} \end{array} $$
(5)
In Eq. (5), we set y
i
=l
(k) for x
i
(i∈C
k
) as the major class is used for labeling all the samples in C
k
during the training process. When w and b are obtained, the problem of Eq. (4) becomes
$$\begin{array}{*{20}l} & \underset{\omega}{\mathop{\min}}\,\sum\limits_{i=1}^{N}{{{\omega}_{i}}L\left({{y}_{i}},{{\mathbf{w}}^{T}}\varphi({{x}_{i}})+b \right)}+\frac{{{\lambda}_{p}}}{\lambda}\sum\limits_{k=1}^{K}{{{L}_{p}}\left({{{\tilde{p}}}_{k}}(\omega),{{p}_{k}} \right)} \\ & \mathrm{s}.\text{t}.\;{{\omega}_{i}}\in \ [\!0,1],i=1,\ldots,N \end{array} $$
(6)
The problem of Eq. (6) can be solved by optimizing ω
i
(i∈C
k
) on each cell separately as the influence of each cell C
k
on the objective function in Eq. (6) is independent. This means Eq. (6) can be converted into a cell-wise optimization form,
$$\begin{array}{*{20}l} & \underset{\{{{\omega}_{i}}|i\in {{C}_{k}}\}}{\mathop{\min}}\,\sum\limits_{i\in {{C}_{k}}}{{\omega}_{i}L}{\left({{y}_{i}},{{\mathbf{w}}^{T}}\varphi ({{x}_{i}})+b \right)}+\frac{{{\lambda}_{p}}}{\lambda}{{L}_{p}}\left({{\tilde{p}}_{k}}(\omega),{{p}_{k}}\right) \\ & \mathrm{s}.\text{t}. \ {{\omega}_{i}}\in \ [\!0,1],i\in {{C}_{k}} \end{array} $$
(7)
where the absolute loss \({{L}_{p}}\left ({{\tilde {p}}_{k}}(\omega),{{p}_{k}}\right)=|{{\tilde {p}}_{k}}(\omega)-{{p}_{k}}|\) is used in this paper. We adopt an approximate cell-wise optimization method for estimating weight ω
i
in Eq. (7).
First, we calculate potential R(l
(k)|x
i
), a measure for sample reliability or label agreement, that is, how likely x
i
will be classified as y
i
=l
(k) by the classifier. It is defined as
$$\begin{array}{@{}rcl@{}} R\left({{l}^{(k)}}|{{x}_{i}}\right)=E\left({{l}^{(k)}}|{{x}_{i}}\right)-\underset{l\in {{\mathcal{L}}^{\backslash{{l}^{(k)}}}}}{\mathop{\text{min}}}\,\left(E(l|{{x}_{i}})\right), \end{array} $$
(8)
where E(l|x
i
)=− log(P(l|x
i
)). The class posterior distribution P(l|x
i
) is estimated by LIBSVM [16]. LIBSVM uses the method in [21] to estimate class probability for classification problem. The basic idea of the method is to estimate pairwise (i.e., one-against-one) class probability from decision values (or cost values) and then solve an optimization problem to get multi-class probability. Compared to the simple potential form E(l
(k)|x
i
), the definition in Eq. (8) is expected to better depict the potential of x
i
to be classified as l
(k), by measuring the distance between the potential of label l
(k) and the minimal value among all the other labels (i.e., the label most likely to replace l
(k)). Smaller value of R(l
(k)|x
i
) means l
(k) is more reliable to reflect the true label of x
i
.
Then, we sort x
i
(i∈C
k
) in ascending order by R(l
(k)|x
i
) and note the sorted sample sequence as \(\phantom {\dot {i}\!}{{x}_{{{\delta }_{i}}}}\), where δ
i
is the index value (or position) of x
i
in the sorted sequence.
Next, update sample weight for x
i
(i∈C
k
) using the following reweighting formula,
$$\begin{array}{@{}rcl@{}} {{\omega}_{i}}=\left\{\begin{array}{*{35}{l}} 1,{{\delta}_{i}}\le {{N}_{m}} \\ \exp \left(-\frac{{{({{\delta}_{i}}-{{N}_{m}})}^{2}}}{\theta {{\left| {{C}_{k}} \right|}^{2}}} \right),{{N}_{m}}<{{\delta}_{i}}\le {{N}_{s}} \\ 0,{{\delta}_{i}}>{{N}_{s}} \\ \end{array}\right. \end{array} $$
(9)
where N
s
=int(p
k
|C
k
|) is the sample number of major class l
k
in C
k
; \({{N}_{m}}=\frac {\left | {{C}_{k}} \right |}{M}\) is the number of samples to be assigned with weight 1; and θ is a parameter to control the degree of weight weakening for N
m
<δ
i
≤N
s
. Typically, θ=0.5 will work well in most cases. According to the formula, we assign strong weight (ω
i
>0) for the first N
s
samples with smallest R(l
(k)|x
i
) (where i satisfies δ
i
≤N
s
) and zero weight for the others. To further enhance the robustness of our method to the estimation error of label proportion p
k
, the sample weight is designed to introduce a penalty term for N
m
<δ
i
≤N
s
, instead of simply assigning ω
i
=1 for all δ
i
≤N
s
, as illustrated in Fig. 3.
In Eq. (9), N
s
is used to control the difference between the predicted and true label proportions by setting the number of the predicted majorclass samples to that of the true majorclass samples in C
k
explicitly. According to \({{\tilde {p}}_{k}}(\omega)\) defined before, we have
$$\begin{array}{@{}rcl@{}} {{\tilde{p}}_{k}}(\omega)=\frac{1}{|{{C}_{k}}|}\sum\limits_{i\in {{C}_{k}}}{[\!{{\omega}_{i}}>0]}=\frac{{N}_{s}}{|{{C}_{k}}|}. \end{array} $$
(10)
Define the operator “int” for rounding a float to a smaller integer. By setting N
s
=int(p
k
|C
k
|), we get \({{\tilde {p}}_{k}}(\omega)\approx {{p}_{k}}\). Then \({{L}_{p}}=|{{\tilde {p}}_{k}}(\omega)-{{p}_{k}}|\approx 0\), which means the loss function in Eq. (7) is minimized to its first term, i.e., the weighted sum of the loss values of the chosen N
s
samples. So we have
$$\begin{array}{*{20}l} & \underset{\{{{\omega}_{i}}|i\in {{C}_{k}}\}}{\mathop{\min}}\,\sum\limits_{i\in {{C}_{k}}}{{{\omega}_{i}}L\left({{x}_{i}}\right)}+\frac{{{\lambda}_{p}}}{\lambda}{{L}_{p}}\left({{{\tilde{p}}}_{k}}(\omega),{{p}_{k}} \right) \\ & =\underset{\{{{\omega}_{i}}|i\in {{C}_{k}}\}}{\mathop{\min}}\,\sum\limits_{i\in {{C}_{k}}}{{{\omega}_{i}}L\left({{x}_{i}}\right)} \end{array} $$
(11)
where we use L(x
i
) to denote L(y
i
,w
T
φ(x
i
)+b) for short.
Let ω
′
i
be the updated version of sample weights ω
i
using the reweighting strategy in Eq. (9). Then, ω
′
i
is decreasing (non-increasing) with δ
i
(see Fig. 3). Recall that \(\phantom {\dot {i}\!}\{{{x}_{{{\delta }_{i}}}}\}\) is the sample sequence after sorting { x
i
} in ascending order of the potential \(R\left ({{l}^{(k)}},{{x}_{{\delta }_{i}}}\right)\), which means \(R\left ({{l}^{(k)}},{{x}_{{\delta }_{i}}}\right)\) is increasing (non-decreasing) with δ
i
. For a sample \(\phantom {\dot {i}\!}{{x}_{{{\delta }_{i}}}}\), a smaller value of the potential \(\phantom {\dot {i}\!}R\left ({{l}^{(k)}},{{x}_{{{\delta }_{i}}}}\right)\) indicates a greater probability of l
(k) to be the true label of \(\phantom {\dot {i}\!}{{x}_{{\delta }_{i}}}\), which also means a smaller value of sample cost \(\phantom {\dot {i}\!}L({{x}_{{\delta }_{i}}})\) in Eq. (7). So \(\phantom {\dot {i}\!}L({{x}_{{\delta }_{i}}})\) is also increasing (non-decreasing) with δ
i
.
According to the principle of rearrangement inequality [22], \(\sum \limits _{i\in {{C}_{k}}}{{{\omega }_{i}}L({{x}_{i}})}\) will be minimized when the sequences { ω
i
} and { L(x
i
)} are in opposite orders. Since { ω
′
i
} is in descending order by δ
i
and \(\phantom {\dot {i}\!}\{L({{x}_{{{\delta }_{i}}}})\}\) is in ascending order by δ
i
, we have \(\underset {\{{{\omega }_{i}}|i\in {{C}_{k}}\}}{\mathop {\min }}\,\sum \limits _{i\in {{C}_{k}}}^{{}}{{{\omega }_{i}}L\left ({{x}_{i}} \right)}=\sum \limits _{i\in {{C}_{k}}}{{{{\omega }'}_{i}}L({{x}_{{{\delta }_{i}}}})}\). Finally, Eq. (7) is minimized as
$$\begin{array}{*{20}l} & \underset{\{{{\omega}_{i}}|i\in {{C}_{k}}\}}{\mathop{\min }}\,\sum\limits_{i\in {{C}_{k}}}{{{\omega}_{i}}L\left({{x}_{i}} \right)}+\frac{{{\lambda}_{p}}}{\lambda}{{L}_{p}}\left({{{\tilde{p}}}_{k}}(\omega),{{p}_{k}} \right) \\ & =\sum\limits_{i\in {{C}_{k}}}{{{{\omega}'}_{i}}L\left({{x}_{{{\delta}_{i}}}} \right)} \end{array} $$
(12)
which means the cost function in Eq. (7) is minimized by setting higher weight ω
′
i
to lower sample cost \(\phantom {\dot {i}\!}L({{x}_{{\delta }_{i}}})\) using our LpcSVM algorithm.
Besides, we choose \({{N}_{m}}=\frac {\left | {{C}_{k}} \right |}{M}\) in Eq. (7) for the following reason. As there are M labels for classification, the infimum of p
k
is \(\inf ({{p}_{k}})=\frac {1}{M}\). We get \(\inf ({{N}_{s}})=\inf ({{p}_{k}})\times \left | {{C}_{k}} \right |=\frac {\left | {{C}_{k}} \right |}{M}\). Despite possible noises in p
k
, at least \({{N}_{m}}=\frac {\left | {{C}_{k}} \right |}{M}\) samples in each cell should be made full use of by assigning weight 1 to them. A larger value of δ
i
means a less reliable sample. In our reweighting strategy, weakened weights are assigned to less reliable ones (i.e., N
m
<δ
i
≤N
s
) among the N
s
samples chosen for the next iteration (see Fig. 3). Then, small fluctuation of N
s
will not cause considerable changes in the weight distribution, making the method of better robustness to possible estimation errors in label proportions. Experimental performance of the robustness of our method to the estimation error in label proportions will be described in Section 5.4.
By combining label agreement [17] and label proportion similarity between the predicted and true labels, our reweighting strategy will not only weaken the role of unreliable samples in the training set T but also keep the important role of a certain number of reliable samples within the constraints of label proportions for the major class in each cell.
The updated ω is used to retrain the SVM classifier. After a number of iterations, strong weights will be assigned to the reliable samples and weak weights to the others, producing an optimized pixel-level training set (in form of optimized sample weights). Finally, the classifier will predict the label for each sample in the test data using the optimized model parameters.
Although partially inspired by the idea of ∝SVM [11], our model is essentially different with ∝SVM in the following respects: (1) ∝SVM performs on multiclass label proportions, while our model performs on majorclass label proportion, which makes sample labeling more efficient; (2) ∝SVM only deals with binary classification, while our model deals directly with multi-class classification; (3) Besides, our model is optimized based on sample reweighting, which is totally different with ∝SVM and also makes multiclass classification realizable.