This section presents a method for error degree estimation in numerical weather prediction using MKDAbased ordinal regression. In this paper, we define prediction error degrees as several ordered discrete ranks. Specifically, we divide the axis of the prediction error into several intervals and assign a symbol, which corresponds to the prediction error degree, for each interval as shown in the example in Figure 1. As shown in Figure 2, the proposed method tries to estimate the unknown prediction error degree that will occur in the forecast in each area from known observed meteorological data. Specifically, from observed data obtained several hours ago, the prediction error degree is estimated for each area based on MKDAbased ordinal regression. In order to perform training of MKDA, we have to provide pairs of features extracted from the observed meteorological data and known prediction error degrees. Therefore, those training pairs are prepared from past observation times, e.g., a few days before the target day for which prediction error degrees are estimated.
Section 2.1 presents details of feature extraction from meteorological data, and Section 2.2 presents an algorithm for estimating the prediction error degree based on ordinal regression using MKDA.
2.1 Feature extraction from meteorological data
This subsection presents details of feature extraction from meteorological data. Using the proposed method, we try to estimate the prediction error degree in each area by using features calculated from ‘previously observed errors of a target meteorological element and some other related elements’ and ‘their time variations’ in the same area and its neighboring areas. Only features of neighboring areas in which the atmospheres move to the target area affect the prediction error degree estimation of the target area.
Suppose that prediction error degree estimation of a target meteorological element F_{0} is performed in area p at time t. Furthermore, it is assumed that the prediction errors of several related meteorological elements including F_{0}, F_{
l
} (l=0,1,⋯,L; L+1 being the number of meteorological elements used for calculating features) at time t−s Δ t (s=0,1,⋯,S) are known, where the index s is used for referring to the current or past time steps. The proposed method calculates time average and maximum and minimum values of the prediction errors e_{
l
}(p,t), {x}_{l}^{\text{ave}}(\mathbf{p},t), {x}_{l}^{\text{max}}(\mathbf{p},t), {x}_{l}^{\text{min}}(\mathbf{p},t), respectively, and the average of their time variations {x}_{l}^{\text{tv}}(\mathbf{p},t) between time t−(s+1)Δ t and t−s Δ t for each meteorological element F_{
l
} in each area p as feature values. In this section, e_{
l
}(p,t) represents the prediction error of meteorological element F_{
l
} in area p at time t. In detail, {x}_{l}^{\text{ave}}(\mathbf{p},t), {x}_{l}^{\text{max}}(\mathbf{p},t), {x}_{l}^{\text{min}}(\mathbf{p},t), and {x}_{l}^{\text{tv}}(\mathbf{p},t) can be obtained as follows:
\begin{array}{l}{x}_{l}^{\text{ave}}(\mathbf{p},t)=\frac{1}{S+1}\sum _{s=0}^{S}{e}_{l}(\mathbf{p},t\mathrm{s\Delta t}),\phantom{\rule{2em}{0ex}}\end{array}
(1)
\begin{array}{l}{x}_{l}^{\text{max}}(\mathbf{p},t)=\underset{s=0,1,\cdots \phantom{\rule{0.3em}{0ex}},S}{max}{e}_{l}(\mathbf{p},t\mathrm{s\Delta t}),\phantom{\rule{2em}{0ex}}\end{array}
(2)
\begin{array}{l}{x}_{l}^{\text{min}}(\mathbf{p},t)=\underset{s=0,1,\cdots \phantom{\rule{0.3em}{0ex}},S}{min}{e}_{l}(\mathbf{p},t\mathrm{s\Delta t}),\phantom{\rule{2em}{0ex}}\end{array}
(3)
\begin{array}{l}{x}_{l}^{\text{tv}}(\mathbf{p},t)=\frac{1}{S}\sum _{s=0}^{S1}{x}_{l,s}^{\text{tv}}(\mathbf{p},t),\phantom{\rule{2em}{0ex}}\end{array}
(4)
where
\begin{array}{l}{x}_{l,s}^{\text{tv}}(\mathbf{p},t)={e}_{l}(\mathbf{p},t\mathrm{s\Delta t}){e}_{l}(\mathbf{p},t(s+1\left)\mathrm{\Delta t}\right).\end{array}
(5)
For calculating the four features {x}_{l}^{\text{ave}}(\mathbf{p},t), {x}_{l}^{\text{max}}(\mathbf{p},t), {x}_{l}^{\text{min}}(\mathbf{p},t), and {x}_{l}^{\text{tv}}(\mathbf{p},t) of the target area p for meteorological element F_{
l
} at time t, the errors e_{
l
}(p,t),e_{
l
}(p,t−Δ t),e_{
l
}(p,t−2Δ t),⋯,e_{
l
}(p,t−S Δ t) are used. This means that errors in the same area p at current and past times are used for the calculation. Note that in Equations 1 to 5, the index l is used for referring to the l th meteorological element F_{
l
}.
Furthermore, in order to use the prediction errors that are propagated from neighboring areas to the target area based on atmospheric movements, motion vectors representing atmospheric movements between time t−Δ t and time t are obtained by using observed wind velocities for each area. Then, the feature {x}_{l}^{\text{neighbor}}(\mathbf{p},t) of the prediction error for meteorological element F_{
l
} propagated from the neighboring areas to the target area p is obtained by the following equation:
\begin{array}{l}{x}_{l}^{\text{neighbor}}(\mathbf{p},t)={\text{average}}_{{\mathbf{p}}^{\ast}\in \phantom{\rule{1.0pt}{0ex}}{R}_{(\mathbf{p},t)}}\left[{e}_{l}({\mathbf{p}}^{\ast},t\mathrm{\Delta t})\right],\end{array}
(6)
where average[·] is an operator calculating the average values. Furthermore, R_{(p,t)} represents a set of areas in which atmospheres move to the target area p from time t−Δ t to time t. Specifically, by denoting the atmospheric movement of area p^{∗} from time t−Δ t to time t as v(p^{∗},t−Δ t), R_{(p,t)} can be represented as follows:
\begin{array}{l}{R}_{(\mathbf{p},t)}=\left\{{\mathbf{p}}^{\ast}{\mathbf{p}}^{\ast}+\mathbf{v}\left({\mathbf{p}}^{\ast},t\mathrm{\Delta t}\right)\approx \mathbf{p}\right\}.\end{array}
(7)
By using the atmospheric movements, we can select areas in which atmospheres move to the target area p from time t−Δ t to time t. As shown in Equation 7, we define the neighbor of p. In this equation, the neighbor is a set of areas p^{∗} satisfying p^{∗}+v(p^{∗},t−Δ t)≈p. This means ∥p^{∗}+v(p^{∗},t−Δ t)−p∥_{
c
}<ε is satisfied for a small positive constant ε. Note that ∥·∥_{
c
} represents the chessboard distance. In this study, we set ε to 5 km. Since the distance between the most neighboring areas is 5 km in the dataset used in the experiments, we set ε to 5 km. Therefore, if the distance between the most neighboring areas changes, we should also change ε. Furthermore, it is well known that if the distance between the most neighboring areas becomes smaller, the performance of the numerical weather prediction also becomes better. Similarly, it can be expected that the performance of the prediction error degree estimation becomes better if the distance between the most neighboring areas becomes smaller. As shown in the example in Figure 3, the atmospheres of six areas, i.e., yellow areas, move to the target area p. Therefore, these six areas are regarded as neighbors. Then, the prediction errors e_{
l
}(p^{∗},t−Δ t) of these six areas p^{∗} in the previous time step t−Δ t are averaged, and the feature {x}_{l}^{\text{neighbor}}(\mathbf{p},t) is obtained. If none of the areas are moving into the target area p, {x}_{l}^{\text{neighbor}}(\mathbf{p},t) is set to zero in our method. By using the features in Equation 6, the influence of prediction errors propagated to the target area p can be considered.
From the features {x}_{l}^{\text{ave}}(\mathbf{p},\phantom{\rule{0.3em}{0ex}}t),{x}_{l}^{\text{max}}(\mathbf{p},\phantom{\rule{0.3em}{0ex}}t),{x}_{l}^{\text{min}}(\mathbf{p},t),{x}_{l}^{\text{tv}}(\mathbf{p},t), and {x}_{l}^{\text{neighbor}}(\mathbf{p},t) (l=0,1,⋯,L), we can define a feature vector for each area p at time t. Note that these five features in area p at time t for meteorological element F_{
l
} can be calculated from several isometric surfaces in the proposed method. Therefore, given the number of isometric surfaces as J, 5J features are obtained in area p at time t for each meteorological element F_{
l
}. Then, for each area p at time t, a total of d(=5J×(L+1)) features is obtained, i.e., a ddimensional feature vector is finally obtained.
2.2 Algorithm for estimation of prediction error degree
This subsection presents an algorithm for estimating prediction error degrees using MKDAbased ordinal regression. The proposed method estimates class labels, i.e., prediction error degrees at each area p, from the features obtained as described in the previous subsection.
First, we denote a set of training data (x_{
i
},y_{
i
})∈R^{d}×R (i=1,2,⋯,M; M being the number of training samples) as {\mathcal{T}}_{M}. Each x_{
i
}∈R^{d} is a ddimensional (d being the number of features shown in the previous subsection) input feature vector, and y_{
i
}∈{1,2,⋯,K} is the corresponding ordered class label, where K is the number of classes. This label can be obtained by quantizing the known prediction error of target meteorological element F_{0} into K ranks as shown in the example in Figure 1. In Figure 1, K is set to seven. Furthermore, the proposed method maps x_{
i
} into the feature space via a nonlinear map [26], and \varphi \left({\mathbf{x}}_{i}\right)\in \mathcal{F} is obtained. In the proposed method, ϕ(·) is a nonlinear map which projects an input vector to highdimensional feature space. Furthermore, represents this highdimensional feature space. Note that its dimension depends on the definition of the corresponding kernel function of the nonlinear map ϕ(·). Since ϕ(x_{
i
}) is highdimensional or infinitedimensional, it may not be possible to calculate them directly. Fortunately, it is well known that the following computational procedures depend only on the inner products in the feature space, which can be obtained from a suitable kernel function [26]. Given two vectors x_{
m
} and x_{
n
}∈R^{d}, our method uses the following multiple kernel functions:
\begin{array}{l}\varphi {\left({\mathbf{x}}_{m}\right)}^{\prime}\varphi \left({\mathbf{x}}_{n}\right)=\sum _{l=0}^{L}{\gamma}_{l}{\kappa}_{l}\left({\mathbf{E}}_{l}{\mathbf{x}}_{m},{\mathbf{E}}_{l}{\mathbf{x}}_{n}\right),\end{array}
(8)
where γ_{
l
} represents the weight of the l th kernel κ_{
l
}(·,·) and satisfies γ_{
l
}≥0 and \sum _{l=0}^{L}{\gamma}_{l}=1. The superscript ^{′} denotes the vector/matrix transpose in this paper. Furthermore, E_{
l
} in Equation 8 is a diagonal matrix whose diagonal elements are zero or one, and it enables the extraction of features of meteorological element F_{
l
} that are used for the l th kernel κ_{
l
}(·,·). Specifically, the dimension of vectors x_{
m
} and x_{
n
} is d(=5J×(L+1)) as defined in the previous subsection, and each extraction matrix E_{
l
} extracts 5J features of each meteorological element F_{
l
}. In the proposed method, the multiple kernel scheme is applied to different meteorological elements. Since we use L+1 kinds of meteorological elements, L+1 kernels are linearly combined in Equation 8. As shown in the previous subsection, various kinds of meteorological elements can be used for estimation of the prediction error degree. Thus, the proposed method extends the kernel function to a multiple kernel version as shown in Equation 8. Then, by successfully determining the parameters γ_{
l
} (l=0,1,⋯,L) in the multiple kernel function, the features can be mapped into the optimal feature space, enabling accurate ordinal regression. It is important to successfully determine the parameters γ_{
l
}, and details of their determination are shown in Section 2.2.2. Note that κ_{
l
}(·,·) of each meteorological element F_{
l
} is set to the wellknown Gaussian kernel in our method.
2.2.1 Sampling of training data
Note that when the kernel method is adopted, direct use of MKDA, for which computation depends only on the inner products in the feature space, becomes difficult due to the large amount of training data.
Therefore, the proposed method uses a sampling scheme.
Specifically, we regard the error data e_{
l
}(p,t) at time t as twodimensional signals and perform their downsampling.
Then, the new sampled training data (x_{
i
},y_{
i
}) (i=1, 2,⋯,N; N<M) can be obtained, where (x_{
i
},y_{
i
}) is redefined, and N is the number of new training samples. In the following explanations of training of MKDA, we use these training data (x_{
i
},y_{
i
}) (i=1,2,⋯,N). Also, we denote a set of these sampled training data as {\mathcal{T}}_{N}.
By reducing the number of training samples, the performance of the KDAbased ordinal regression tends to become worse. Note that in the proposed method, we regard the error data e_{
l
}(p,t) as twodimensional signals and perform their downsampling. Generally, neighboring areas in meteorological data tend to have similar features, and it seems that the distribution of training data is not drastically changed by the sampling. Thus, performance degradation tends to be avoided. Furthermore, in the proposed method, the remaining training data in {\mathcal{T}}_{M}{\mathcal{T}}_{N}, which are removed by the sampling, can be used for estimating γ_{
l
} (l=0,1,⋯,L) in Equation 8. Fortunately, by using these remaining data, we can improve the performance of the error degree estimation based on the multiple kernel scheme. The details are shown in 2.2.3.
2.2.2 Derivation of MKDA
The objective of the discriminant analysis is to find a projection w from which different classes can be well separated. Specifically, we first define withinclass and betweenclass scatter matrices as follows:
\begin{array}{ll}{\mathbf{S}}_{w}& =\frac{1}{N}\sum _{k=1}^{K}\sum _{j=1}^{{N}^{k}}\left\{\varphi \left({\mathbf{x}}_{j}^{k}\right){\mathbf{m}}^{k}\right\}{\left\{\varphi \left({\mathbf{x}}_{j}^{k}\right){\mathbf{m}}^{k}\right\}}^{\prime}\phantom{\rule{2em}{0ex}}\\ =\frac{1}{N}\sum _{k=1}^{K}{\mathit{\Xi}}^{k}{\mathbf{H}}^{k}{{\mathit{\Xi}}^{k}}^{\prime},\phantom{\rule{2em}{0ex}}\end{array}
(9)
\begin{array}{ll}{\mathbf{S}}_{b}& =\frac{1}{N}\sum _{k=1}^{K}{N}^{k}\left({\mathbf{m}}^{k}\mathbf{m}\right){\left({\mathbf{m}}^{k}\mathbf{m}\right)}^{\prime},\phantom{\rule{2em}{0ex}}\end{array}
(10)
where
\begin{array}{ll}{\mathbf{m}}^{k}& =\frac{1}{{N}^{k}}\sum _{j=1}^{{N}^{k}}\varphi \left({\mathbf{x}}_{j}^{k}\right)\phantom{\rule{2em}{0ex}}\\ =\frac{1}{{N}^{k}}{\mathit{\Xi}}^{k}{\mathbf{1}}^{k}.\phantom{\rule{2em}{0ex}}\end{array}
(11)
In the above equations, {\mathbf{x}}_{j}^{k}\left(\phantom{\rule{0.3em}{0ex}}j=1,2,\cdots \phantom{\rule{0.3em}{0ex}},{N}^{k}\right) corresponds to x_{
i
} (i=1,2,⋯,N) belonging to the k th class (i.e., y_{
i
}=k), and m^{k} denotes the mean vector of \varphi \left({\mathbf{x}}_{j}^{k}\right) belonging to the k th class as shown in Equation 11. Furthermore, {\mathit{\Xi}}^{k}=\left[\varphi \left({\mathbf{x}}_{1}^{k}\right),\varphi \left({\mathbf{x}}_{2}^{k}\right),\cdots \phantom{\rule{0.3em}{0ex}},\varphi \left({\mathbf{x}}_{{N}^{k}}^{k}\right)\right], and {\mathbf{H}}^{k}={\mathbf{I}}^{k}\frac{1}{{N}^{k}}{\mathbf{1}}^{k}{{\mathbf{1}}^{k}}^{\prime} is a centering matrix satisfying H^{k}^{′}=H^{k} and (H^{k})^{2}=H^{k}, where I^{k} is the N^{k}×N^{k} identity matrix, and 1^{k}=[1,1,⋯,1]^{′} is an N^{k}×1 vector. The vector \mathbf{m}=\frac{1}{N}\sum _{i=1}^{N}\varphi \left({\mathbf{x}}_{i}\right) in Equation 10 stands for the mean vector of all samples x_{
i
} (i=1,2,⋯,N). Generally, the objective of the discriminant analysis can be achieved by the following equation:
\begin{array}{l}minJ\left(\mathbf{w}\right)=\frac{{\mathbf{w}}^{\prime}{\mathbf{S}}_{w}\mathbf{w}}{{\mathbf{w}}^{\prime}{\mathbf{S}}_{b}\mathbf{w}}.\end{array}
(12)
In the proposed method, we perform ordinal regression with K ordered classes. Therefore, the goal of our method is to find the optimal projection w satisfying the following two points:

1.
The projection w should minimize the withinclass distance and maximize the betweenclass distance simultaneously.

2.
The projection w should ensure ordinal information of different classes, i.e., the average projection of samples from higher rank classes should be larger than that of lower rank classes.
Therefore, the formulation for the MKDAbased ordinal regression is derived from [13] as follows:
\begin{array}{ll}minJ(\mathbf{w},\rho )& ={\mathbf{w}}^{\prime}{\mathbf{S}}_{w}\mathbf{w}\mathrm{C\rho}\phantom{\rule{2.77626pt}{0ex}}\phantom{\rule{0.3em}{0ex}}\text{s.t.}\phantom{\rule{2.77626pt}{0ex}}\phantom{\rule{0.3em}{0ex}}{\mathbf{w}}^{\prime}\left({\mathbf{m}}^{k+1}{\mathbf{m}}^{k}\right)\phantom{\rule{2em}{0ex}}\\ \ge \rho ,\phantom{\rule{2.77626pt}{0ex}}\phantom{\rule{0.3em}{0ex}}k=1,2,\cdots \phantom{\rule{0.3em}{0ex}},K1,\phantom{\rule{2em}{0ex}}\end{array}
(13)
where C>0 represents a penalty coefficient. The above equation minimizes the withinclass distances. Furthermore, instead of using the betweenclass scatter matrix directly, the above equation tries to maximize the distance between the two projected means from the closest pair of classes. Specifically, Equation 12 tries to minimize the withinclass distance and maximize the betweenclass distance simultaneously. On the other hand, Equation 13 reformulates the problem of Equation 12. The withinclass distance is minimized from w^{′}S_{
w
}w in J(w,ρ). Furthermore, instead of directly maximizing the betweenclass distance, a new constraint w^{′}(m^{k+1}−m^{k})≥ρ (k=1,2,⋯,K−1) is introduced. In this way, our MKDAbased ordinal regression tries to estimate the projection w minimizing withinclass distance with the constraint of ordinal information. Thus, the distribution direction can be considered by using the withinclass scatter matrix S_{
w
} in Equation 13.
In MKDA, the projection w is highdimensional or infinitedimensional and cannot be calculated directly. Thus, the projection w is written as follows:
\begin{array}{ll}\mathbf{w}& =\sum _{i=1}^{N}{\beta}_{i}\varphi \left({\mathbf{x}}_{i}\right)\\ =\mathit{\Xi}\mathit{\beta},\end{array}
(14)
where β_{
i
}(i=1,2,⋯,N) is a linear coefficient, and β=[β_{1},β_{2},⋯,β_{
N
}]^{′}. In addition, Ξ = [ϕ(x_{1}),ϕ(x_{2}),⋯, ϕ(x_{
N
})]. In the proposed method, we derived Equation 14 by using the method in [13]. In several kernel methods such as KDA and KPCA, the projection is represented by a linear combination of the samples. Therefore, we adopt the above derivation. However, derivation of the representer theorem in a multiple kernel case is not so straightforward. In the proposed method, we approximately use the derivation of Equation 14. Since the theoretical analysis of this approximation cannot be shown in this paper, this will be a future work of our study.
By using Equations 9, 11, and 14 and J(w,ρ) and w^{′}(m^{k+1}−m^{k}) in Equation 13 are respectively rewritten as follows:
\begin{array}{rcl}J(\mathbf{w},\rho )& =& {\left(\mathit{\Xi}\mathit{\beta}\right)}^{\prime}\left(\frac{1}{N}\sum _{k=1}^{K}{\mathit{\Xi}}^{k}{\mathbf{H}}^{k}{{\mathit{\Xi}}^{k}}^{\prime}\right)\mathit{\Xi}\mathit{\beta}\mathrm{C\rho}\\ =& {\mathit{\beta}}^{\prime}\left\{\frac{1}{N}\sum _{k=1}^{K}\left({\mathit{\Xi}}^{\prime}{\mathit{\Xi}}^{k}\right){\mathbf{H}}^{k}\left({{\mathit{\Xi}}^{k}}^{\prime}\mathit{\Xi}\right)\right\}\mathit{\beta}\mathrm{C\rho}\\ =& {\mathit{\beta}}^{\prime}\left\{\frac{1}{N}\sum _{k=1}^{K}{\mathbf{\text{G}}}^{k}{\mathbf{H}}^{k}{{\mathbf{\text{G}}}^{k}}^{\prime}\right\}\mathit{\beta}\mathrm{C\rho}\\ =& {\mathit{\beta}}^{\prime}\mathbf{\text{T}}\mathit{\beta}\mathrm{C\rho},\end{array}
(15)
\begin{array}{rcl}{\mathbf{w}}^{\prime}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\left({\mathbf{m}}^{k+1}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}{\mathbf{m}}^{k}\right)& =& {\left(\mathit{\Xi}\mathit{\beta}\right)}^{\prime}\left(\frac{1}{{N}^{k+1}}{\mathit{\Xi}}^{k+1}{\mathbf{1}}^{k+1}\frac{1}{{N}^{k}}{\mathit{\Xi}}^{k}{\mathbf{1}}^{k}\right)\\ =& {\mathit{\beta}}^{\prime}\left(\frac{1}{{N}^{k+1}}{\mathit{\Xi}}^{\prime}{\mathit{\Xi}}^{k+1}{\mathbf{1}}^{k+1}\frac{1}{{N}^{k}}{\mathit{\Xi}}^{\prime}{\mathit{\Xi}}^{k}{\mathbf{1}}^{k}\right)\\ =& {\mathit{\beta}}^{\prime}\left(\frac{1}{{N}^{k+1}}{\mathbf{\text{G}}}^{k+1}{\mathbf{1}}^{k+1}\frac{1}{{N}^{k}}{\mathbf{\text{G}}}^{k}{\mathbf{1}}^{k}\right)\\ =& {\mathit{\beta}}^{\prime}\left({\mathbf{\text{r}}}^{k+1}{\mathbf{\text{r}}}^{k}\right),\end{array}
(16)
where
\begin{array}{ll}{\mathbf{\text{G}}}^{k}& ={\mathit{\Xi}}^{\prime}{\mathit{\Xi}}^{k},\phantom{\rule{2em}{0ex}}\end{array}
(17)
\begin{array}{ll}\mathbf{\text{T}}& =\frac{1}{N}\sum _{k=1}^{K}{\mathbf{\text{G}}}^{k}{\mathbf{H}}^{k}{{\mathbf{\text{G}}}^{k}}^{\prime},\phantom{\rule{2em}{0ex}}\end{array}
(18)
\begin{array}{ll}{\mathbf{\text{r}}}^{k}& =\frac{1}{{N}^{k}}{\mathbf{\text{G}}}^{k}{\mathbf{1}}^{k}.\phantom{\rule{2em}{0ex}}\end{array}
(19)
The problem of w in Equation 13 can be rewritten as that of β as follows:
\begin{array}{ll}minJ(\mathit{\beta},\rho )& ={\mathit{\beta}}^{\prime}\mathbf{\text{T}}\mathit{\beta}\mathrm{C\rho}\phantom{\rule{1em}{0ex}}\text{s.t.}\phantom{\rule{1em}{0ex}}{\mathit{\beta}}^{\prime}\left({\mathbf{\text{r}}}^{k+1}{\mathbf{\text{r}}}^{k}\right)\phantom{\rule{2em}{0ex}}\\ \ge \rho ,\phantom{\rule{1em}{0ex}}k=1,2,\cdots \phantom{\rule{0.3em}{0ex}},K1.\phantom{\rule{2em}{0ex}}\end{array}
(20)
In order to solve Equation 20, we define the following Lagrangian equation:
\begin{array}{ll}\mathcal{\mathcal{L}}(\mathit{\beta},\rho ,\mathit{\alpha})& ={\mathit{\beta}}^{\prime}\mathbf{\text{T}}\mathit{\beta}\mathrm{C\rho}\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{1em}{0ex}}\sum _{k=1}^{K1}{\alpha}^{k}\left\{{\mathit{\beta}}^{\prime}\left({\mathbf{\text{r}}}^{k+1}{\mathbf{\text{r}}}^{k}\right)\rho \right\},\phantom{\rule{2em}{0ex}}\end{array}
(21)
where α=[α^{1},α^{2},⋯,α^{K−1}]^{′} represents a vector containing Lagrange multipliers. Then, by calculating \frac{\partial \mathcal{\mathcal{L}}}{\partial \mathit{\beta}}=0 and \frac{\partial \mathcal{\mathcal{L}}}{\mathrm{\partial \rho}}=0, the following equations are respectively obtained:
\begin{array}{ll}\mathit{\beta}& =\frac{1}{2}{\mathbf{\text{T}}}^{1}\sum _{k=1}^{K1}{\alpha}^{k}\left({\mathbf{\text{r}}}^{k+1}{\mathbf{\text{r}}}^{k}\right),\phantom{\rule{2em}{0ex}}\end{array}
(22)
\begin{array}{ll}\sum _{k=1}^{K1}{\alpha}^{k}& =\mathrm{C.}\phantom{\rule{2em}{0ex}}\end{array}
(23)
From the above equations, the optimization problem in Equation 20 is turned into
\begin{array}{ll}min\left(\mathit{\alpha}\right)& =\sum _{k=1}^{K1}{\alpha}^{k}{\left({\mathbf{\text{r}}}^{k+1}{\mathbf{\text{r}}}^{k}\right)}^{\prime}{\mathbf{\text{T}}}^{1}\sum _{k=1}^{K1}{\alpha}^{k}\left({\mathbf{\text{r}}}^{k+1}{\mathbf{\text{r}}}^{k}\right)\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{1em}{0ex}}\text{s.t.}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}{\alpha}^{k}\ge 0,\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}k=1,2,\cdots \phantom{\rule{0.3em}{0ex}},K1,\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\text{and}\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{1em}{0ex}}\sum _{k=1}^{K1}{\alpha}^{k}=\mathrm{C.}\phantom{\rule{2em}{0ex}}\end{array}
(24)
The proposed method estimates the optimal result of α by using the penalty method [27].
2.2.3 MKDAbased ordinal regression and determination of kernels’ contributions
As shown in the above explanation, we can obtain the optimal projection of w from β obtained by α.
From the obtained optimal projection w, the rank of an unseen input feature vector x, i.e., the prediction error degree at each area p, can be estimated by the following decision rule:
\begin{array}{l}f\left(\mathbf{x}\right)=\underset{k\in \left\{1,2,\cdots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{1.5pt}{0ex}}K\right\}}{min}\left\{k:{\mathbf{w}}^{\prime}\varphi \left(\mathbf{x}\right){b}_{k}<0\right\},\end{array}
(25)
where b_{
k
} is defined as
\begin{array}{l}{b}_{k}=\frac{{\mathit{\beta}}^{\prime}\left({\mathbf{\text{r}}}^{k+1}+{\mathbf{\text{r}}}^{k}\right)}{2}.\end{array}
(26)
Then, from Equations 14 and 26, Equation 25 is rewritten as follows:
\begin{array}{l}\phantom{\rule{10.0pt}{0ex}}f\left(\mathbf{x}\right)=\underset{k\in \left\{1,2,\cdots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{1.5pt}{0ex}}K\right\}}{min}\left\{k:{\mathit{\beta}}^{\prime}\left({\mathit{\Xi}}^{\prime}\varphi \left(\mathbf{x}\right)\frac{{\mathbf{\text{r}}}^{k+1}+{\mathbf{\text{r}}}^{k}}{2}\right)<0\right\}.\end{array}
(27)
The above equation enables ordinal regression for estimating prediction error degrees.
Note that since our method adopts a multiple kernel algorithm, we also have to determine γ=[γ_{0},γ_{1},⋯,γ_{
L
}]^{′} in Equation 8. Some methods such as simple MKL [21] have been proposed for determining γ. However, if the simple MKL is used for estimating the parameters of the linear combination of kernels, successful performance of the error degree estimation is not possible. We guess when using the simple MKL, the result of γ that provides w optimal for Equation 13 is obtained from the sampled training data, and the generalization characteristic becomes worse, and then a phenomenon similar to overfitting occurs. This means that the fitting of γ tends to strongly depend on the sampled training data. Furthermore, it has been reported that the MKL approach does not always outperform a single kernelbased approach [19]. Thus, it is important to determine γ in such a way that the performance can be guaranteed for other new samples to keep the robustness.
Fortunately, we have the other remaining training data (x_{
i
},y_{
i
}) included in {\mathcal{T}}_{M}{\mathcal{T}}_{N}, which are removed by the sampling scheme. Therefore, we use them and verify the estimation performance of the error degrees, and the best result of γ is determined by an exhaustive search. Generally, the statistical characteristics of the test data tend to be slightly different from that of the training data. Therefore, in order to make the proposed method robust to this difference, we use the other remaining training data included in {\mathcal{T}}_{M}{\mathcal{T}}_{N}. Specifically, the proposed method changes the values of γ_{
l
} (l=0,1,2,⋯,L) as 0.1,0.2,⋯,1.0 with the constraints of γ_{
l
}≥0 (l=0,1,2,⋯,L) and \sum _{l=0}^{L}{\gamma}_{l}=1.0. Then, the problem in Equation 28 is optimized with α. Furthermore, for the data (x_{
i
},y_{
i
}) included in {\mathcal{T}}_{M}{\mathcal{T}}_{N}, which are the remaining training data after the sampling, we perform the estimation of prediction error degrees and calculate their mean absolute errors.
Specifically, the errors between the estimated error degree f(x_{
i
}) in Equation 31 and the true rank y_{
i
} are calculated. Finally, we output the optimal result of γ that provides the most accurate estimation results, i.e., the minimum mean absolute error. By using this exhaustive search procedure, the proposed method enables determination of the multiple kernel function. Note that in the same way as γ, the proposed method estimates the parameter of the Gaussian kernel for each meteorological element, where the searching interval is set to 0.5.
In this way, we can perform prediction error degree estimation by using ordinal regression based on MKDA. In numerical weather prediction, various observed inputs, i.e., various input feature vectors, are obtained. Since discriminant analysis can consider the global information of the data with the distribution of the classes, the proposed method adopts it for the estimation. Furthermore, the proposed method adopts sampling of the training data and effectively uses the remaining data for estimating the optimal parameters of the multiple kernel scheme. This is the biggest difference between the proposed method and the conventional KDAbased method.