Self-organizing kernel adaptive filtering
- Songlin Zhao^{1},
- Badong Chen^{2}Email author,
- Zheng Cao^{1},
- Pingping Zhu^{1} and
- Jose C. Principe^{1}
DOI: 10.1186/s13634-016-0406-3
© The Author(s) 2016
Received: 20 June 2016
Accepted: 28 September 2016
Published: 12 October 2016
Abstract
This paper presents a model-selection strategy based on minimum description length (MDL) that keeps the kernel least-mean-square (KLMS) model tuned to the complexity of the input data. The proposed KLMS-MDL filter adapts its model order as well as its coefficients online, behaving as a self-organizing system and achieving a good compromise between system accuracy and computational complexity without a priori knowledge. Particularly, in a nonstationary scenario, the model order of the proposed algorithm changes continuously with the input data structure. Experiments show the proposed algorithm successfully builds compact kernel adaptive filters with better accuracy than KLMS with sparsity or fixed-budget algorithms.
Keywords
Kernel method Model selection Sparsification Minimal description length1 Introduction
Owing to their universal modeling capability and convex cost, kernel adaptive filters (KAFs) are attracting renewed attention. Even though these methods achieve powerful classification and regression performance, the model order (system structure) and computational complexity grows linearly with the number of processed data, for example, the model order of kernel least-mean-square (KLMS) [1] and kernel recursive-least-square (KRLS) [2] scales as O(n) with respect to the number of samples, where n is the number of processed data. The computational complexity are O(n) and O(n ^{2}) at each iteration, respectively. This characteristic hinders widespread use of KAFs unless the filter growth is constrained. To curb their growth to sublinear rates, not all samples are included in the dictionary and a number of different criteria for online sparsification techniques are adopted: the novelty criterion [3, 4], approximate linear dependency (ALD) criterion [2, 5], coherence [6], the surprise criterion [7], and quantization [8, 9]. Alternatively, pruning criteria discard redundant centers from the existing large dictionary [10–17].
Even though these techniques obtain a compact filter representation and even a fixed-budget model, they still have drawbacks. For example, sparsification algorithms make growth sublinear but cannot constrain network size (model order) in a predefined range in nonstationary environments. Fixed-budget algorithms [12, 14–16] still require presetting the network size a priori, which also is a major drawback in nonstationary environments. Indeed, in such environments, the complexity of the time series as seen by a filter increases during the transitions because of the mixture of modes and can switch to a very low complexity in the next mode. Online adjustment of the filter order in infinite impulse filter (IIR) or finite impulse filter (FIR) is unreasonable because all filter coefficients need to be recomputed when the filter order is increased or decreased, which will cause undesirable transients. However, this is trivial in KLMS because of one major reason: the filter grows at each iteration, while the past parameters remain fixed. Hence, the most serious disadvantage of KLMS, its continuous growth, may become a feature that allows unprecedented exploration of the design space, creating effectively a filter topology optimally tuned to the complexity of the input, both in terms of adaptive parameters and model order. The model order selection can be handled by searching an appropriate compromise between accuracy and network size [18]. We adopt the minimum description length (MDL) criterion as the criterion to adaptively decide the model structure of KAFs. The MDL principle, first proposed by Rissanen in [19], is related to the theory of algorithmic complexity [20]. Rissanen formulated model selection as data compression, where the goal is to minimize the length of the combined bit stream of the model description concatenated with the bit stream describing the error. MDL utilizes the description length as a measure of the model complexity and selects the model with the least description length.
Besides MDL, several other criteria have been proposed to deal with the accuracy/model order compromise. The pioneering work of the Akaike information criterion (AIC) [21] was followed by the Bayesian Information Criterion (BIC, which is also known as the “Schwarz information criterion (SIC)”) [22, 23], the predictive minimum description length (PDL) [24], the normalized maximum likelihood [25], and the Bayesian Ying-Yang (BYY) information criterion [26]. The utilization of the MDL in our work is based on the fact that it is robust against noise and relatively easy to estimate online, which is a requirement of our design proposal. Compared with others, MDL also has the great advantage of relatively small computational costs [27, 28].
The paper structure is as follows: in Section 2, we present a brief review of the MDL principle and the related kernel adaptive filter algorithms, KLMS and quantized KLMS (QKLMS). Section 3 proposes a novel KLMS sparsification algorithm based on an online version of MDL. Because this algorithm utilizes quantization techniques, it is called QKLMS-MDL. The comparative results of the proposed algorithms are shown in Section 4, and final conclusions and discussion are given in Section 5.
2 Foundations
2.1 Minimum description length
The MDL principle addresses a system model as a data codifier and suggests choosing the model which provides the shortest description length. The basic principle of MDL estimates both the cost of specifying the model parameters and the associated model prediction errors [27].
Intuitively, this criterion penalizes large models by taking into consideration the requirement of specifying a large number of weights. The best model according to MDL is the one that minimizes the sum of the model complexity and the number of bits required to encode their errors. MDL has rich connections with other model-selection frameworks. Obviously, minimizing \(-\log P(\boldsymbol {\chi }(n)|\hat {\boldsymbol {\theta }})\) is equivalent to maximizing \( P(\boldsymbol {\chi }(n)|\hat {\boldsymbol {\theta }})\). Therefore, in this sense, MDL coincides with the penalized maximum likelihood (ML) [34] in the parametric estimation problem, as well as AIC. The only difference is the penalty term (AIC uses k, the model order instead of Eq. 3). Furthermore, MDL has close ties to Bayesian principles (MDL approximates BIC asymptotically). Therefore, the MDL paradigm serves as an objective platform from which we can compare Bayesian and non-Bayesian procedures alike, even though the theoretical underpinnings behind them are much different.
The superiority of MDL has been indicated in various applications. In neural networks, MDL was adopted to determine the number of units that mimic the underlying dynamic property of the system [27, 35]. After that, as an improvement, the MDL criterion was utilized to directly determine an optimal neural network model and successfully applied to prediction problems [36] and control systems [37]. Furthermore, the embedding dimension of an artificial neural network is decided based on constructing a global model with a least description length [38]. Starting from an overly complex model and then pruning unneeded basis function according to MDL, Leonardis and Bischof [39] proposed a radial basis function (RBF) network formulation to balance accuracy performance, training time, and network complexity. Besides neural networks, MDL has also been successfully used in vector quantization [40], clustering [31, 41, 42], graphs [43, 44], and so on. In most cases, MDL is used for supervised learning as a penalty term on the error function or as a criterion for model selection [40]. One exception is the work of Zemel [33] who applied MDL to determine a suitable data-encoding schema. Compared with MDL, AIC criterion tends to select a model that is too complex and is not appropriate for small data sets.
In this work, MDL, as described by Eq. 4, will be utilized in a very specific and unique scenario, which can be best described as a stochastic extension of MDL (MDL-SE) to preserve compatibility with online learning. Recall that we are interested in continuously estimating the best filter order (the dictionary length) online when the statistical properties of the signals change in short windows (locally stationary environments). The KAF provides a simple update of the parameter vector on a sample-by-sample basis; the error description length can be estimated from the short-term sequence of the local error, but we have to estimate appropriately the error sequence probability. Both estimations seem practical. Also recall that a KAF using a stochastic gradient descent with a small step size always reduces locally the a priori error [1]. Therefore, there is an instantaneous feedback mechanism linking the local error and the local changes in the filter weights, which also simplifies the performance testing of the algorithm. In these conditions, we cannot work anymore with statistical operators (expected value) and will use instead temporal average operators on samples in the recent past of the current sample. Moreover, estimating the probability of the error in Eq. 4 must also be interpreted as a local operation in a sliding time window over the time series, with a length relevant to reflect the local statistics of the input. So MDL-SE creates a new parameter, the window length that needs to be selected a priori, and its influence in performance needs to be appropriately quantified and compared with alternative KAF approaches.
2.2 Kernel least-mean-square algorithm
where ε(n) and Ω(n) are, respectively, the prediction error and weight vector at time index n. η is the step size. α(i) is the ith element of α(n) to simplify the notation in this section. α(n)=[η ε(1),η ε(2),…η ε(n)] is the coefficient vector of KLMS at n.
where κ(·,·) is a Mercer kernel. In this work, the default kernel is the Gaussian kernel, κ(x,x ^{′})= exp(−||x−x ^{′}||^{2}/2σ ^{2}), where σ is the kernel size. The set of samples u(i), used to position the positive defined function, constitutes the dictionary of centers \(\mathcal {C}(n) = \{\boldsymbol {u}(i)\}_{i=1}^{n}\), which grows linearly with the input samples. KLMS provides a well-posed solution with finite data [1]. Moreover, as an online learning algorithm, KLMS is much simpler to implement, in terms of computational complexity and memory storage, than other batch-model kernel methods because all weights (the errors) remain fixed during the weight update.
2.3 Quantized kernel least-mean-square algorithm
Quantization techniques have been introduced to KLMS to develop a sparsified kernel filter, called QKLMS. This algorithm quantizes the input space with a simple vector quantization (VQ) algorithm to curb the network size. Every time a new sample arrives, the QKLMS algorithm checks if its distance to the available centers is less than the predefined minimal distance. If there is an existing center sufficiently close to the new sample, the center dictionary and network size remain unchanged but the coefficient of the closest center will be updated. Otherwise, the new sample is included in the dictionary. For online kernel learning, most of the existing VQ algorithms are not suitable because the center dictionary is usually trained offline and the computational burden is heavy. Therefore, a very simple and greedy online VQ method is used here based on the Euclidean distance in the input space between the sample and the existing centers.^{2} QKLMS has been shown to be superior to all the other methods of curbing the dictionary growth [8], and the major difference is that the information available in the input is never discarded, it is used to update the VQ coefficients with each new input sample. Because of nonstationarity, it is unclear how one can cross-validate this parameter, so the minimal distance needs to be selected a priori for each application on a representative data segment, and experience shows that performance changes smoothly around the optimum [8]. The sufficient condition for QKLMS mean-square convergence and a lower and upper bound on the theoretical value of the steady-state excess mean square error (EMSE) are studied in [8]. The summary of the QKLMS algorithm with online VQ is presented in Algorithm 1.
3 MDL-based quantized kernel least-mean-square algorithm
The basic idea of the proposed algorithm, QKLMS-MDL, is as follows: once a new datum is received, the cost of adding this datum as a new center or merging it to its nearest center is calculated. Here, the distance measure for the merge operation is the same as the QKLMS [8]. Then the procedure with smaller description length is adopted: the proposed algorithm compares the strategy of discarding an existing center with the one of keeping it according to the MDL criterion. This process is repeated until all existing centers are scanned, such that the network size can be adjusted adaptively for every sample particularly in nonstationary situations. We show next how to estimate the MDL-SE criterion sufficiently well in these nonstationary conditions and that KAF MDL filters outperform conventional techniques of sparsification or fixed budget presented in the literature.
3.1 Objective function
This corresponds to an implicit Gaussian model for the probability density function (pdf) of the local error; however, we are not attempting to justify theoretically that the Gaussian is the best pdf to fit local errors in this scenario; we are just estimating the log likelihood as required by MDL-SE. Note also that the intrinsic sample by sample feedback of the QKLMS helps here, because if the estimate is not correct, this simply says that the decision of decreasing or increasing the filter order will be wrong, the local error will increase, and the filter will self-correct the order for the next sample. This will create added error power with regard to the optimal performance, and since we are going to compare QKLMS-MDL with the conventional KAF approaches, this will show up as noncompetitive performance.
3.2 Formulation of MDL-based quantized kernel least-mean-square algorithm
where \(\hat {\epsilon }(i)\) expresses the estimated prediction error after adding a new center and \(\bar {\epsilon }(i)\) is the approximated prediction error after merging. If Δ L _{model_1}(n)>0, the cost of adding a new center is larger than the cost of merging. Therefore, this new data should be merged to its nearest center according to the MDL criteria. Otherwise, a new center is added into the center dictionary.
Substituting these two equations into Eq. 12, it is straightforward to obtain Δ L _{model_1}(n).
Similar to QKLMS, QKLMS-MDL merges a sample into its nearest center in the input space. The difference is that QKLMS-MDL quantizes the input space according to not only the input data distance but also the prediction error, which results in higher accuracy.
We have just solved the problem of how to increase the network size efficiently. But this is insufficient in a nonstationary condition where older centers should also be discarded. General methods to handle this situation utilize a nonstationary detector, like the likelihood ratio test [46], M-estimators [47], and spectra analysis, to check whether the true system or the input data shows different statistical properties. However, the computational complexity of these detectors is high and their accuracy is not good enough. The ability of MDL to estimate system complexity according to the data complexity inspired us to also apply MDL as the criterion for adaptively discarding existing centers. After checking whether a new datum should be added to the center dictionary, the proposed algorithm compares the description length costs between discarding an existing center and keeping the datum, and the strategy with the smaller description length is taken. This procedure is repeated until all existing centers are scanned.
Otherwise, ε(i) doesn’t change.
Algorithm 2 gives a summary of the proposed QKLMS-MDL algorithm. Compared with traditional KLMS, the computational complexity of the proposed method is improved. At each iteration, the computational complexity to deciding whether a new center should be added or not is O(L _{ w }) and it is O(M L _{ w }) for deciding whether an existing center should be discarded or not.
The above derivation assumes that model accuracy and simplicity are equally important, as in the stationary case. In practice, when the system model order is low, the performance accuracy plays a more important role than model simplicity. In fact, in this case, the instantaneous errors may be frequently small by chance and QKLMS-MDL will wrongly interpret that the model order is high and needs to be decreased. This is dangerous when the filter order is low, because the filer has few degrees of freedom and the penalty in accuracy will be high. Therefore, we adopt a strategy that privileges system accuracy, i.e., if the network size is smaller than a predefined minimal model order threshold N, no matter what is the value of Eq. 15, the corresponding center should be kept in the center dictionary. The disadvantage of this heuristic is that it creates an extra free parameter in the modeling besides the window size L _{ w }. In our simulations, we kept N=5 because the experiments show good results.
The other free parameter L _{ w } adjusts the compromise between system accuracy and network size, and it is more important. Note that in Eq. 10 with the estimator of Eq. 11 L _{ w } enters linearly in the error term and as a log in the model complexity term. Therefore, long windows put a higher constraint on the error than on the model order, which indirectly emphasizes model accuracy. When L _{ w }=1, we are just using maximum likelihood. We experimentally verified the effect of L _{ w } in simulations because we still do not have a systematic approach to select this parameter based on the data but can advance some experimental observations: (a) the larger the L _{ w }, the more emphasis is given to system accuracy instead of model complexity. Otherwise, a smaller L _{ w } yields simpler (fewer parameters) models. (b) If the model order changes frequently, the value of L _{ w } should be reduced, such that the data utilized to estimate the description length has a higher probability of being in the stationary regime. Finally, even though the procedure of discarding existing centers scales proportionally to the window size only, there is no need to check it at every sample in local stationary environments because they do not occur that fast nor all the time. We suggest that the update be done associated with the local stationary of the input data, i.e., at a rate at least twice the inverse of L _{ w }.
4 Simulation
In this section, we test the QKLMS-MDL in three signal-processing applications. We begin by exploring the behavior of QKLMS-MDL in a simple environment. Next, we move to a real-data time-series prediction problem, and we finalize with an application in speech processing. The tests presented are online learning tests when the weights are learned from an initial zero state and never fixed, and so, they represent performance in unseen data, similar to test set results, except that the free model parameters have been set a priori at their best values. To gauge the effect of the free parameters of the algorithms, we also present results across a set of their possible values around the optimum. Monte Carlo tests are in some conditions conducted to illustrate the effect of variability across different conditions.
4.1 System model
and the corresponding design signal is y(n). In this section, both the kernel size and the step size are set at 1.0. The online MSE is calculated based on the mean of the prediction error averaged in a running window of 100 samples.
The final network size comparison of QKLMS and QKLMS-MDL in a system identification problem
Abrupt change | \(\frac {1}{500}\) | \(\frac {1}{5000}\) | |
---|---|---|---|
QKLMS | 118.4±4.22 | 95.08±3.47 | 124.37±3.52 |
QKLMS-MDL | 12.77±7.55 | 14.89±9.19 | 14.23±10.85 |
Parameter settings in a channel equalization problem to achieve almost the same final MSE in each condition
Abrupt change | \(\frac {1}{500}\) | \(\frac {1}{5000}\) | |
---|---|---|---|
QKLMS | γ=0.65 | γ=0.7 | γ=0.65 |
QKLMS-MDL | L _{ w }=100 | L _{ w }=100 | L _{ w }=100 |
4.2 Santa Fe time-series prediction
Parameter settings for different algorithms in a time-series prediction
Same network size | Same final MSE | |
---|---|---|
QKLMS | γ=1.97 | γ=1.54 |
QKLMS-MDL | L _{ w }=150 | L _{ w }=150 |
KLMS-NC | σ _{1}=1.38 | σ _{1}=0.85 |
σ _{2}=0.001 | σ _{2}=0.001 | |
KLMS-SC | λ=0.005 | λ=0.005 |
T _{1} = 90, T _{2} = −0.085 | T _{1} = 300, T _{2} = −1.6 |
4.3 Speech signal analysis
Prediction analysis is a widely used technique for the analysis, modeling, and coding of speech signals [1]. A slowly time-varying filter could be used to model the vocal tract, while a white noise sequence (for unvoiced sounds) represents the glottal excitation. Here, we use the kernel adaptive filter to establish a nonlinear prediction model for the speech signal. Ideally, for each different phoneme, which represents a window of time series with the same basic statistics, the prediction model of speech should adjust accordingly to “track” such change. Therefore, we can conjecture to identify the phonetic changes through observing the kernel adaptive filter behavior. For example, when the filter order has local peaks, the phonetics should be changing because the filter observes two different stationary regimes, as demonstrated in Section 4.2. Therefore, such abrupt filter change can be used for phoneme segmentation, which is rather difficult to do even for the human observer.
A short whole voiced sentence from a male is used as the source signal. The original voice file can be downloaded from [50]. To increase the MDL model accuracy when the stationary window is short, this signal is upsampled to 16,000 Hz from the original 10,000 Hz. Then QKLMS is applied by the previous 11 samples u(i)=[x(i−11),…,x(i−1)]^{ T }. In this section, the Gaussian kernel with σ=0.3 is selected. The step size is 0.7 according to the cross-validation test. The quantization factor of QKLMS is set as 0.67 and the window length of QKLMS-MDL is 50, such that both QKLMS and QKLMS-MDL have relatively similar prediction MSE performance.
5 Conclusions
This paper proposes for the first time a truly self-organizing adaptive filter where all the filter weight and order are adapted online from the input data. How to choose an efficient kernel model to make a trade-off between computational complexity and system accuracy in nonstationary environments is the ultimate test for an online adaptive algorithm. Based on the popular and powerful MDL principle, a sparsification algorithm for kernel adaptive filters is proposed. Experiments show that the QKLMS-MDL successfully adjusts the network size according to the input data complexity while keeping the accuracy in an acceptable range. This property is very useful in nonstationary conditions while other existing sparsification methods keep increasing the network size. Fewer free parameters and an online model makes QKLMS-MDL practical in real-time applications.
We believe that many applications will benefit from QKLMS-MDL. For example, speech signal processing is an extremely interesting area. Nonstationary behavior is the nature of speech. Furthermore, owing to a sufficient number of samples to quantify short-term stationarity, front-end signal processing (acoustic level) based on QKLMS-MDL seems to be a good methodology to improve the quantification of speech recognition because of its fast response. However, this is still a very preliminary paper, and many more results and better arguments for the solutions are required. Although self-organization solves some problems, it also brings new ones in the processing. In fact, as described in Section 3, the QKLMS-MDL algorithm “forgets” in nonstationary conditions the previous learning results and needs to relearn the input-output mapping when the system switches to a new state. First, the filter weights would have to be read and clustered to identify the sequence of phonemes. Second, the MDL strategy keeps the system model at the simplest structure, but when a previous state appears again, QKLMS-MDL must relearn from scratch because the former centers are removed from the center dictionary. This relearning should be avoided, and increasingly accurate modeling techniques should be developed. Future work will have to address this aspect of the technique.
6 Endnotes
^{1} The unit of description length also could be nat, which is calculated by ln rather than log2 as we do here. For simplification, we use log to present log2 in the following part of this paper.
^{2} A translation invariant kernel establishes a continuous mapping from the input space to the feature space. As such, the distance in feature space is monotonically increasing with the distance in the original input space. Therefore, for a translation invariant kernel, the VQ in the original input space also induces a VQ in the feature space.
Declarations
Acknowledgments
This work was supported by NSF Grant ECCS0856441, NSF of China Grant 61372152, and the 973 Program 2015CB351703 in China.
Authors’ contributions
SZ conceived of the idea and drafted the manuscript. JP, BC, and PZ have contributed in the supervision and guidance of the manuscript. CZ participated in the simulation and experiment. All authors did the research and data collection. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
- W Liu, PP Pokharel, JC Príncipe, The kernel least mean square algorithm. IEEE Trans. Sig. Process. 56(2), 543–554 (2008).MathSciNetView ArticleGoogle Scholar
- Y Engel, S Mannor, R Meir, The kernel recursive least-squares algorithm. IEEE Trans. Sig. Process. 52(8), 2275–2285 (2004).MathSciNetView ArticleGoogle Scholar
- J Platt, A resource-allocating network for function interpolation. Neural Comput. 3(4), 213–225 (1991).MathSciNetView ArticleGoogle Scholar
- P Bouboulis, S Theodoridis, Extension of Wirtinger’s calculus to reproducing kernel Hilbert spaces and the complex kernel LMS. J. IEEE Trans. Sig. Process. 59(3), 964–978 (2011).MathSciNetView ArticleGoogle Scholar
- K Slavakis, S Theodoridis, Sliding window generalized kernel affine projection algorithm using projection mappings. EURASIP J. Adv. Sig. Process. 1:, 1–16 (2008).MATHGoogle Scholar
- C Richard, JCM Bermudez, P Honeine, Online prediction of time series data with kernels. IEEE Trans. Sig. Process. 57(3), 1058–1066 (2009).MathSciNetView ArticleGoogle Scholar
- W Liu, Il Park, JC Príncipe, An information theoretic approach of designing sparse kernel adaptive filters. IEEE Trans. Neural Netw. 20(12), 1950–1961 (2009).View ArticleGoogle Scholar
- B Chen, S Zhao, P Zhu, JC Príncipe, Quantized kernel least mean square algorithm. IEEE Trans. Neural Netw. Learn. Syst. 23(1), 22–32 (2012).View ArticleGoogle Scholar
- B Chen, S Zhao, P Zhu, JC Príncipe, Quantized kernel recursive least squares algorithm. IEEE Trans. Neural Netw. Learn. Syst. 24(9), 1484–1491 (2013).View ArticleGoogle Scholar
- SV Vaerenbergh, J Via, I Santamana, A sliding-window kernel RLS algorithm and its application to nonlinear channel identification. IEEE Int. Conf. Acoust. Speech Sig. Process, 789–792 (2006).Google Scholar
- SV Vaerenbergh, J Via, I Santamana, Nonlinear system identification using a new sliding-window kernel RLS algorithm. J. Commun. 2(3), 1–8 (2007).View ArticleGoogle Scholar
- SV Vaerenbergh, I Santamana, W Liu, JC Príncipe, Fixed-budget kernel recursive least-squares. IEEE Int. Conf. Acoust. Speech Sig. Process, 1882–1885 (2010).Google Scholar
- M Lázaro-Gredilla, SV Vaerenbergh, I Santamana, A Bayesian approach to tracking with kernel recursive least-squares. IEEE Int. Work. Mach. Learn. Sig. Process. (MLSP), 1–6 (2011).Google Scholar
- S Zhao, B Chen, P Zhu, JC Príncipe, Fixed budget quantized kernel least-mean-square algorithm. Sig. Process. 93(9), 2759–2770 (2013).View ArticleGoogle Scholar
- D Rzepka, in 2012 IEEE 17th Conference on Emerging Technologies & Factory Automation (ETFA). Fixed-budget kernel least mean squares (Krakow, 2012), pp. 1–4.Google Scholar
- K Nishikawa, Y Ogawa, F Albu, in Signal and Information Processing Association Annual Summit and Conference (APSIPA). Fixed order implementation of kernel RLS-DCD adaptive filters (Asia-Pacific, 2013), pp. 1–6.Google Scholar
- K Slavakis, P Bouboulis, S Theodoridis, Online learning in reproducing kernel Hilbert spaces. Sig. Process. Theory Mach. Learn.1:, 883–987 (2013).Google Scholar
- W Gao, J Chen, C Richard, J Huang, Online dictionary learning for kernel LMS. IEEE Trans. Sig. Process. 62(11), 2765–2777 (2014).MathSciNetView ArticleGoogle Scholar
- J Rissanen, Modeling by shortest data description. Sig. Process. 14(5), 465–471 (1978).MATHGoogle Scholar
- M Li, PMB Vitányi, An introduction to Kolmogorov complexity and its applications (Publisher, Springer-Verlag, New York Inc, 2008).View ArticleMATHGoogle Scholar
- H Akaike, A new look at the statistical model identification. IEEE Trans. Autom. Control. 19(2), 716–723 (1974).MathSciNetView ArticleMATHGoogle Scholar
- G Schwarz, Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978).MathSciNetView ArticleMATHGoogle Scholar
- A Barron, J Rissanen, B Yu, The minimum description length principle in coding and modeling. IEEE Trans. Inf. Theory. 44(6), 2743–2760 (1998).MathSciNetView ArticleMATHGoogle Scholar
- J Rissanen, Universal coding, information, prediction, and estimation. IEEE Trans. Inf. Theory. 30(4), 629–636 (1984).MathSciNetView ArticleMATHGoogle Scholar
- J Rissanen, MDL denoising. IEEE Trans. Inf. Theory. 46(7), 2537–2543 (2000).MathSciNetView ArticleMATHGoogle Scholar
- L Xu, Bayesian Ying Yang learning (II): a new mechanism for model selection and regularization. Intell. Technol. Inf. Anal, 661–706 (2004).Google Scholar
- Z Yi, M Small, Minimum description length criterion for modeling of chaotic attractors with multilayer perceptron networks. IEEE Trans. Circ. Syst. I: Regular Pap. 53(3), 722–732 (2006).View ArticleGoogle Scholar
- T Nakamura, K Judd, AI Mees, M Small, A comparative study of information criteria for model selection. Int. J. Bifurcation Chaos Appl. Sci. Eng. 16(8) (2153).Google Scholar
- T Cover, J Thomas, Elements of information theory (Wiley, 1991).Google Scholar
- M Hansen, B Yu, Minimum description length model selection criteria for generalized linear models. Stat. Sci. A Festschrift for Terry Speed. 40:, 145–163 (2003).MathSciNetView ArticleGoogle Scholar
- K Shinoda, T Watanabe, MDL-based context-dependent subword modeling for speech recognition. Acoust. Sci. Technol. 21(2), 79–86 (2000).Google Scholar
- AA Ramos, The minimum description length principle and model selection in spectropolarimetry. Astrophys. J. 646(2), 1445–1451 (2006).View ArticleGoogle Scholar
- RS Zemel, A minimum description length framework for unsupervised learning, Dissertation (University of Toronto, 1993).Google Scholar
- AWF Edwards, Likelihood, (Cambridge Univ Pr, 1984).Google Scholar
- M Small, CK Tse, Minimum description length neural networks for time series prediction. Astrophys. J. 66(6), 066701 (2002).Google Scholar
- A Ning, H Lau, Y Zhao, TT Wong, Fulfillment of retailer demand by using the MDL-optimal neural network prediction and decision policy. IEEE Trans. Ind. Inform. 5(4), 495–506 (2009).View ArticleGoogle Scholar
- JS Wang, YL Hsu, An MDL-based Hammerstein recurrent neural network for control applications. Neurocomputing. 74(1), 315–327 (2010).View ArticleGoogle Scholar
- YI Molkov, DN Mukhin, EM Loskutov, AM Feigin, GA Fidelin, Using the minimum description length principle for global reconstruction of dynamic systems from noisy time series. Phys. Rev. E. 80(4), 046207 (2009).View ArticleGoogle Scholar
- A Leonardis, H Bischof, An efficient MDL-based construction of RBF networks. Neural Netw. 11(5), 963–973 (1998).View ArticleGoogle Scholar
- H Bischof, A Leonardis, A Selb, MDL principle for robust vector quantisation. Pattern Anal. Appl. 2(1), 59–72 (1999).View ArticleMATHGoogle Scholar
- T Rakthanmanon, EJ Keogh, S Lonardi, S Evans, MDL-based time series clustering. Knowl. Inf. Syst., 1–29 (2012).Google Scholar
- H Bischof, A Leonardis, in 15th International Conference on Pattern Recognition. Fuzzy c-means in an MDL-framework (Barcelona, 2000), pp. 740–743.Google Scholar
- I Jonye, LB Holder, DJ Cook, MDL-based context-free graph grammar induction and applications. Int. J. Artif. Intell. Tools. 13(1), 65–80 (2004).View ArticleGoogle Scholar
- S Papadimitriou, J Sun, C Faloutsos, P Yu, Hierarchical, parameter-free community discovery. Mach. Learn. Knowl. Discov. Databases., 170–187 (2008).Google Scholar
- E Parzen, On estimation of a probability density function and mode. Ann. Math. Stat. 33(3), 1065–1076 (1962).MathSciNetView ArticleMATHGoogle Scholar
- AM Mood, FA Graybill, DC Boes, Introduction to the Theory of Statistics (McGraw-Hill, USA, 1974).MATHGoogle Scholar
- SA Geer, Applications of empirical process theory (The Press Syndicate of the University of Cambridge, Cambridge, 2000).MATHGoogle Scholar
- The Santa Fe time series competition data. http://www-psych.stanford.edu/~andreas/Time-Series/SantaFe.html. Accessed June 2016.
- AS Weigend, NA Gershenfeld, Time series prediction: forecasting the future and understanding the past (Westview Press, 1994).Google Scholar
- Sound files obtained from system simulations. http://www.cnel.ufl.edu/~pravin/Page_7.htm. Accessed June 2016.
- B Bigi, in The eighth international conference on Language Resources and Evaluation. SPPAS: a tool for the phonetic segmentations of Speech (Istanbul, 2012), pp. 1748–1755.Google Scholar
- SPPAS: automatic annotation of speech. http://www.sppas.org. Accessed June 2016.