Clustering Time Series Gene Expression Data Based on Sum-of-Exponentials Fitting

This paper presents a method based on ﬁtting a sum-of-exponentials model to the nonuniformly sampled data, for clustering the time series of gene expression data. The structure of the model is estimated by using the minimum description length (MDL) principle for nonlinear regression, in a new form, incorporating a normalized maximum-likelihood (NML) model for a subset of the parameters. The performance of the structure estimation method is studied using simulated data, and the superiority of the new selection criterion over earlier criteria is demonstrated. The accuracy of the nonlinear estimates of the model parameters is analyzed with respect to the Cram´er-Rao lower bounds. Clustering examples of gene expression data sets from a developmental biology application are presented, revealing gene grouping into clusters according to functional classes.


INTRODUCTION
The gene expression time profiles are a rich source of information about the dynamics of the underlying genomic network.The experiments are often taken at nonuniform time points, suggested by the biologist's intuition about the time scale of the important changes in the analyzed biological process, for example, a developmental process or administration of a drug.Clustering the time profiles of the thousands of genes recorded by the microarrays is a very important exploratory problem, for which several methods have been proposed in the past [1,2,3].
Most of the existing methods, no matter whatever heuristically motivated, or model-based methods [4] do not make use of the time values at which the measurements have been taken, loosing potentially useful information regarding the analyzed waveforms.Some approaches that take into account the temporal structure in gene expression data are based on hidden Markov model [5], spline approximation [6], or on analysis of temporal variation [7].In [8], an autoregressive model is used for the gene expression time series, and the clustering is performed with a Bayesian criterion which measures the similarity between two time series.A comprehensive study on various clustering methods applied to gene expression data that are time series can be found in [9].
A general methodology for modelling the time series collected at nonuniform time points has been presented in [10], where the sum-of-exponentials model was used for getting estimates of the gains and time constants, and then a generalized correlation coefficient was introduced based on the cost of describing all relevant parameters of the waveforms interpolated at an equidistant grid.The generalized correlation coefficient was intended for various applications, including gene prediction in genetic networks and disease classification.The sum-of-exponentials model is appealing since it can be interpreted as the transient output of a linear dynamical system evolving from a first stationary regime to another one.
The work in [10] was first extended in a preliminary version of this paper [11], by elaborating on the first critical stage, that of fitting the sum-of-exponentials model.We introduce a new minimum description length (MDL) criterion for selecting the number of exponentials in the model, and provide a statistical analysis of the fitting accuracy.A fitting procedure combining several known methods is also proposed and is found experimentally to have an accuracy close to Cramér-Rao lower bound.
We apply the proposed methods for finding the dynamical parameters in the models of the time series from two experimental data sets containing gene expressions measured during the development of mouse cerebellum and dentate gyrus.The estimated parameters are subsequently used in a clustering algorithm.
The remainder of the paper is organized as follows.In the next section, we outline the algorithm for fitting a sum of exponentials to nonuniformly sampled data.The expression of Cramér-Rao lower bound is obtained, and the new MDL criterion is introduced for estimating the structure parameter.Based on sum-of-exponentials model, a new clustering procedure is proposed in Section 3, then is tested with simulated data.An experiment with data from developmental biology is conducted in Section 4, and the enrichment of functional categories in clusters found by the proposed method is investigated with statistical tests.

Motivation
A linear differential equations model for the concentrations of mRNA and proteins was introduced in [12].In [13], since usually only the concentrations of mRNA are measured, the differential equations model was modified to the form where y(t) contains only mRNA concentrations as a function of time, and the constant matrix M describes interactions between various genes.Biological considerations lead to constraining the eigenvalues and the eigenvectors of matrix M to be real-valued.Supplementary, the eigenvalues of M are assumed to be negative to ensure that exp(Mt) → 0 as t → ∞.It is well known that the solution for (1) is given by y(t) = exp(Mt)y(0), where y(0) is the vector of measurements at time moment zero.The solution can be expressed as a sum-of-exponential terms multiplied by polynomial functions of t, which is the basic reason for our choosing of the sum of exponentials as a dynamic model for gene expressions.In [12] it was argued that only the case when all these polynomials are constant (have degree zero) has biological relevance.More on linear differential equations models for gene expression data can be found in [14,15].

Problem formulation
We consider an estimation procedure for the following nonlinear regression model: where the model parameters are θ γ ={α j , β j | j = 1, 2, . . ., p}, and the structure parameter is γ = 2p.The α j 's are realvalued and β j 's are taken, without loss of generality, to verify and introduce the residual sum of squares as the sum in (4) evaluated at the LS parameters If time moments are equally spaced, the estimation problem can be rephrased as a simple linear least-squares problem, but even then important difficulties arise when fitting a sum of exponentials: choosing initial values and illconditioning when two or more β j 's are close [16].Since the problem is complex, many algorithms have been proposed to solve it, beginning with the Prony's method introduced as early as 1795.The method was originally used for fitting an exponential model to uniformly sampled experimental data, and consists in solving a set of linear equations for the recurrence equation that the signals satisfy.It was shown in [17] that Prony's method is close to Pisarenko's method, which was analyzed and further improved in signal processing community.Many modified Prony algorithms have been also proposed, see, for example, [16,18].A survey on various algorithms for fitting a sum of exponentials can be found in [19], where a special section is dedicated to minimizing the LS criterion by using standard optimization techniques.One such technique is the Al-Baali-Fletcher algorithm [20], which is a hybrid method in the sense that during the iterations the algorithm switches between GN (Gauss-Newton) and BFGS (Broyden-Fletcher-Goldfarb-Shanno) for the estimation of the Hessian matrix.
When fitting a sum of exponentials by minimizing an LS criterion, a critical part is the choice of initial values for the α j and β j parameters.An algorithm for finding initial values in the particular case when all α j coefficients are strictly positive is given in [21].In the general case when α j 's are not constrained, the grid search is generally applied [22].

An estimation procedure for the nonlinear regression model
Fitting an exponential model to gene expressions is hard since the number of available measurements is small and they are nonuniformly sampled.We resort to a grid search for initializing the parameters.For simplicity of notation, we define two vectors of parameters, namely,  b) is used to initialize the Al-Baali-Fletcher optimization algorithm [20].We employ the Matlab implementation of this algorithm as provided by the Tomlab software, which is publicly available at http://www.mdh.se/ima/personal/khm01/tom/.

Cramér-Rao lower bound (CRLB)
For investigating some statistical aspects of the estimation problem, we resort to the computation of the Cramér-Rao lower bound (CRLB).We denote by F −1 (θ γ ) the inverse of the Fisher information matrix for the signal model when the parameters are θ γ .Let θj be an unbiased estimator for the j'th component of θ γ .A classical result from statistics [23] states that the CRLB for the variance var( θj ) is given by [F −1 (θ γ )] j j , where the index j j designates the entry of the matrix for which both the row and the column are equal to j.The independence assumption is rather strong for gene expression, and further studies will be needed in order to estimate a correlated model for the noise, especially when the number of data points available will increase.Under the hypothesis of white Gaussian noise, we find in Appendix B closed-form expressions for the entries of the Fisher information matrix.First we obtain these expressions for the set of parameters where δ j = exp(−β j ), 1 ≤ j ≤ p, denote the decay rates.Since time constants are more important for the interpretation of results obtained with gene expression data, we further develop the calculus for the Fisher information matrix when the set of parameters is given by In the definition above, we use τ j for the time constants, namely, τ j = T 0 /β j , 1 ≤ j ≤ p, where T 0 is the greatest common divisor of {t 2 − t 1 , t 3 − t 2 , . . ., t n − t n−1 }, and t 1 , t 2 , . . ., t n are assumed to be integer-valued.It is obvious how to obtain the conversion between the time constants and the decay terms: τ j = −T 0 / ln δ j and δ j = exp(−T 0 /τ j ).Under the hypothesis of white Gaussian noise with zero mean and variance σ 2 , the expression of the log-likelihood function is given by It is immediate to observe that the ML estimator for the nonlinear regression model is the one given by (4).In general, the ML estimator is optimal since asymptotically it is unbiased and achieves the CRLB [23], which is a highly desired property.We are interested to assess the estimation results for finite samples, and especially when the number of measurements is small.Even in these cases, it is customary to compare the variance of estimates with CRLB, but two important facts have to be considered when interpreting the results [24]: (a) there exist biased estimators for which the variance is even smaller than CRLB, as shown in the examples discussed in [25,26]; (b) the ML estimator achieves asymptotically the CRLB, but an estimator which achieves the lower bound may not exist for small samples.

Structure parameter estimation
In the discussion above, we have assumed that the structure parameter γ = 2p is known, or equivalently the number of exponential terms in (2) is given.This is not the case in practical applications, thus we need to estimate the value of γ, which amounts to select this value from a finite set of positive even integers.The selection is usually performed based on well-known criteria as MDL or AIC [19].When applying the form of MDL principle called two-stage description length [27], the structure parameter is given by γ * = arg min 2≤γ≤γmax MDL(γ), where We use the notation log(•) to denote the logarithm base two.
The MDL criterion represents the ideal code length for transmitting the values of measurements z(t 1 ), z(t 2 ), . . ., z(t n ) from a hypothesized encoder to a decoder.For a fixed structure γ, the parameters θ γ are estimated as described above, and each parameter is encoded by using (1/2) log n bits, which leads to a total cost that equals the second term in (9).The first term represents the number of bits necessary for encoding z(t 1 ), z(t 2 ), . . ., z(t n ) given the estimated values θγ ; RSS(γ) is calculated as in (5).
We propose to apply a different coding scenario that allows the use of recent advances in universal modeling, namely, the normalized maximum-likelihood (NML) estimator.The key observation is that once the estimated values for β's are known both at the encoder and at the decoder sites, the modeling problem reduces to a linear regression model as shown in Appendix A. Therefore, it is straightforward to use for the ideal code length the NML criterion introduced in [28].It remains only to find a method for transmitting the estimated values of β j 's from the encoder to the decoder.A natural solution is to encode every β j parameter by using the asymptotically optimal number of bits, namely, (1/2) log n bits.We obtain now the nMDL(γ) criterion as a sum of two terms: the first one is given by NML formula from [29], and the second one is (γ/4) log n, the cost for transmitting the β j 's.Therefore, we obtain where â = [ α1 α2 The performances of MDL and nMDL criteria are compared in Section 3.2 for simulated data.

New clustering algorithm
Assume that, applying the procedure described above, we have fitted a sum of exponentials to the time profile measured for a certain gene.Finding similarities between the expressions of this gene and another gene expressions reduces to a comparison between the two sets of the estimated parameters.At this point, a large family of comparison criteria can be considered.For example, we can first compare the estimated structure parameters, and if both model orders are the same we can further compare the gains and the time constants, respectively.Since generally a microarray data set contains measurements for thousands of genes, it is prohibitive to consider all possible pairs of genes for finding similarities.
We decide that two different genes share common regulation if the set of time constants is the same for both of them, and we cluster the genes together.Observe that the proposed similarity measure for genes ignores the gains.We do not know the true values of the time constants, and the estimated values are all different with probability one.We model the time constants estimated for all genes from a microarray data set as outcomes of a Gaussian mixture model, and cluster them in N TC clusters with classification-expectationmaximization (CEM) algorithm [30].The centroids found by CEM are denoted by T i where index i takes values between 1 and N TC .The centroids are increasingly ordered, namely, T i < T j when 1 ≤ i < j ≤ N TC .For clustering the time constants, we have pooled them together no matter the model order inferred for every gene.
We consider that, for a particular gene, the model order given by an information theoretic criterion like MDL or nMDL is p * , and the estimated time constants are τ1 , τ2 , . . ., τp * .We associate to this gene the sequence determined by the centroids of the clusters to which the CEM algorithm assigns the time constants τ1 , τ2 , . . ., τp * .If two or more time constants of the considered gene are assigned to the same cluster, then the corresponding centroid occurs only once in the sequence of centroids, and consequently 1 ≤ π ≤ p * .We cluster together two genes if the same sequence of centroids Therefore, we first cluster the time constants, and then we use the result to further cluster the genes.It is interesting to investigate the relationship between N TC , the number of clusters for time constants, and N GC , the number of gene clusters.Under the hypothesis that the information theoretic criterion selects the number of exponentials from the set {1, 2, . . ., p max }, it is easy to prove that 1 ≤ N GC ≤ pmax i=1 NTC i , where NTC i = 0 for i > N TC .For the case p max ≥ N TC , the inequality becomes 1 ≤ N GC < 2 NTC .
To illustrate the situation when p max < N TC , we choose p max = 3 and N TC = 5.For this selection, the number of gene clusters can potentially be as large as 25.
For completeness, we list in Algorithm 1 the newly introduced gene clustering algorithm.

Experimental results with simulated data
For validating the proposed method, we test it with carefully crafted data.Note that fitting sum of exponentials to the measured data is the crucial step of the procedure, in the sense that unreliable estimates for time constants can lead to false conclusions on the similarity of the genes.This is the reason for which we generate data according to a prototype that was introduced in [31] and used since then as a benchmark to evaluate the performances of various estimation algorithms.The model used in [31] to generate data is the same as the one in (2) with the following parameters: p = 3, α 1 = 0.6, α 2 = 0.3, α 3 = 0.1, and β 1 = 0.1, β 2 = 0.01, β 3 = 0.001.Their proposed task was to estimate the parameters from 20 measurements nonuniformly sampled at time points between 0 and 6000, where no noise was added, but every measurement rounded to four significant digits.It is obvious that their goal is the same as in the estimation problem treated in Section 2, but the solution proposed by [31] applies only when all α j 's are strictly positive.
We extend this example by considering more linear combinations of the same exponential terms.To fix the ideas, in this section, we denote by G i , 1 ≤ i ≤ 10, a gene prototype, and not simply a gene.In Table 1 ten different gene prototypes are shown by indicating for each of them the values of p and α j 's.Note that for all prototypes the β j 's are the same as in the example from [31].Beginning from a particular prototype G i , we generate measurements for a gene by adding i.i.d.noise to the waveform given by G i at the time points 0, 1, 2, 3, 4, 5, 10, 30, 60, 150, 300, 400.We employ Gaussian noise with zero mean and variance σ 2 .For G 1 , we consider only the first 9 time points from the set of 12 time points listed above.The reason is that G 1 takes values very close to zero when time t ≥ 150.Therefore, for genes generated according to prototype G 1 , only 9 nonequidistant measurements are used in estimation, and for prototypes G 2 , . . ., G 10 the estimation of parameters is based on 12 nonequidistant measurements.
To test the capabilities of the method discussed in Section 2, we estimate the parameters of the model (3) from gene measurements simulated according to the previous scenario.For every prototype in Table 1, we generate measurements for 50 different genes, or equivalently, we consider 50 different noise realizations.The "true" time constants span a very large domain from 1 to 1000.This makes difficult the task of defining the domain in the grid-search algorithm.The reported estimation results are obtained when, for every time constant, the search domain is limited to [1,1200], and the search step is 20.The computational burden is decreased by assuming a priori that the set of estimated time constants for every gene is ordered.For improving the numerical conditioning of the algorithm, we force the difference between every two time constants to be larger than 20 3. For each gene that is not labeled with zero, replace every time constant τi , 1 ≤ i ≤ p * , with the centroid of the cluster to which τi was assigned.4. Group together into the same cluster all genes whose time constants are assigned to the same set of centroids.Output: each gene is labeled according to which cluster it belongs.For the genes with label zero, the sum-of-exponentials model does not fit well, therefore they are not included in any cluster.
Table 1: The parameters for the ten gene prototypes used in experiments with simulated data.For all gene prototypes, the model is given in (2), and the smaller the search step, the better the accuracy of initialization points for the optimization algorithm, but a small value for the step search means also a significant computational burden.In the analyzed case, a value of 20 is a good tradeoff between accuracy and complexity.We run the grid-search algorithm and then the optimization algorithm, and report in Table 2 the results obtained when considering 50 trials for every gene prototype.The noise standard deviation is 10 −3 .To have a better image on signal-to-noise ratio, remark that the values of each gene prototype varies between one for time moment zero, and asymptotic value zero.For the results in Table 2, the bias is small and the variance is close to CRLB.These estimations for coefficients α j and time constants τ j = 1/β j are obtained by assuming that the true value of p is known.Using the same simulated data sets, we compare in Table 3 the estimations of the structural parameter obtained by MDL and nMDL criteria.Observe for σ = 10 −3 that nMDL estimates are 100% correct for seven out of ten prototypes, and the proportion of correct estimates is never smaller than 82%.At this level of noise, nMDL does not perform worse than MDL criterion in any of the cases.When the level of noise is increasing, the proportion of correct estimations declines for both MDL and nMDL, but overall we can conclude that nMDL is superior.
To complete the experiments, we have to cluster the estimated time constants, and for this task we use the Matlab programs which are available at http://www.cs.ucl.ac.uk/ staff/D.Corney/ClusteringMatlab.html and http://www.ncrg.aston.ac.uk/netlab/.We try to mimic the real situations when more copies are available for the same microarray.We assume that the microarray contains measurements for 20 genes and 25 copies of it are available.More precisely, we randomly distribute the genes from the data sets already employed in the previous experiments such that to have 25 different copies of the same microarray, and every copy to contain exactly two genes from each prototype.
In the experiments, we apply two different procedures: (a) cluster the time constants estimated for the genes that belong to a microarray copy, and ignore the estimations obtained for the other microarray copies ("one clustering for each copy"); (b) cluster all time constants, estimated for all microarray copies ("one clustering for all copies").For both procedures, we assume that the number of clusters is 3, and report the results in Tables 4, 5, 6, and 7. We mention that in our implementation, all estimated time constants that have values smaller than 1, or larger than 1200 are considered outliers.For clustering a set of time constants, we first eliminate the outliers, and then run the CEM algorithm starting from 40 randomly chosen initialization points.Among the 40 resulting solutions, we select the partition that minimizes the sum of squared errors.
The results in Tables 4 and 5 show how well the estimated time constants have been allocated to clusters.Convention is that numbers represented with shades are counts of how many times the time constants are properly assigned to clusters.In our settings, for an ideal method, all counts represented with shades in Tables 4 and 5 are equal to 50, and all other counts in these tables take value zero.Now we can easily observe in Table 4 that the results yielded by the proposed method when clustering separately every copy, for σ = 10 −3 , are very close to the best possible results.One single misclassification occurs for a gene generated according to prototype G 8 .We investigate closely the estimations for G 8 when σ = 10 −3 : percentages in Table 3 indicate that nMDL estimates correctly the number of clusters for 92% of genes, and overestimates the order for the rest of 8% genes.As we have generated 50 different realizations for G 8 , it means that the model order was estimated to be three for four genes.Based on this observation, we could expect that the row corresponding to G 8 in Table 4 contains value 50 (represented with shades) for the first two counters, and value 4 for the third counter.The value of the last counter is 1, which can be explained as follows: in the case of three G 8 genes for which the order was overestimated, the largest time constant was grouped by CEM together with the second largest time constant.The discussion on this particular example gives hints for the interpretation of the data in Tables 4 and 5, and emphasizes the importance of using an accurate estimator for the structure parameter.When comparing the content of the two tables, we note again the superiority of the nMDL criterion.We observe also that performing "one clustering for all copies" does not improve the grouping of genes.The data have been generated such that the "true" time constants for every gene, in all microarray copies, belong to the set {10, 100, 1000}.We compare next the values 10, 100, 1000, clustering 6 are shown the centroids obtained when applying the scenario "one clustering for all copies".Since "one clustering for each copy" leads to 25 different estimations for every centroid, we report in Tables 6 and 7 the computed mean and standard deviation.Remark in the case when the structure parameter is estimated with nMDL, and noise standard deviation is σ = 10 −3 , that the centroids are close to the "true" values.When σ = 10 −1 , the centroids corresponding to 10 and 100 take values larger than expected.
The clustering results in Tables 4, 5, 6, and 7 are a good measure of the accuracy for the proposed method.Encouraged by these results, we apply next the clustering algorithm for data sets from developmental biology.

Data sets from developmental biology
We apply the newly introduced clustering method for measurements obtained in experimental studies from develop-  [1] is focused on studying the postnatal development of mouse cerebellar cortex, and the second one [3] analyzes the postnatal development of the dentate gyrus of mouse hippocampus.Since the cerebellar cortex and the dentate gyrus have common features, in [1], comparisons are performed between the time profiles obtained in both experiments.The measurements are sampled at six time points, namely, 2, 4, 8, 12, 21, and 42 days after birth.In [3], the comparison of gene expressions relies in Euclidean distance.A statistical analysis is also conducted: first the genes are grouped by using Ward's hierarchical clustering method [32], and then the enrichment of functional categories in each cluster is investigated.As functional class labels have been associated with most of the genes, there is a significant interest on finding a correspondence between time profile of gene expressions, and the functional role played by each gene.Once the clustering is performed, it remains to decide if a particular functional category F appears unusually often within a particular cluster C.

Fitting the sum-of-exponentials model
First we apply the algorithm described in Section 2 for fitting sum of exponentials to the 1412 gene profiles measured during postnatal development of mouse cerebellum [1,3].The optimal number of exponentials in each sum is selected from the set p ∈ {1, 2, 3} by applying the nMDL criterion.For every time constant, the grid-search domain is taken as [1,35], and the search step is 0.3.

Goodness-of-fit testing
Our first aim is to test whether the developmental biology data fit well the sum-of-exponentials model.Among the rich family of testing methods, we choose a criterion that is intuitive and very simple to implement.For each gene, we decide that the estimation is reliable only when the "signal-to-noise ratio" is high enough, or equivalently, when where the notations are like in is chosen in the interval (0, 0.5) based on a procedure described in the sequel.To investigate the robustness of this criterion, we compare it with another one that exploits in a different way the information in the errors observed when fitting the sum-of-exponentials model to gene expression data.To illustrate this second criterion we use the data set measured during postnatal development of mouse cerebellum.For an arbitrary gene in the analyzed data set, we define for the jth sample the relative error , where we use the same notations as in equation (5).Remark from the definition that positive and negative errors are mapped together.We collect all errors computed with the expression above for this data set, and further group them based on their magnitudes.The errors are assumed to be outcomes from a Gaussian mixture with two components: the first group is denoted R g and contains the residuals with small magnitudes that correspond to the case of good fit between the measured data and the sum-of-exponentials model.The rest of the residuals are assigned to the group conventionally denoted by R b .Since we label each error with R g or R b , it follows that the larger the number of R g labels for a gene record, the higher the quality of fit for the gene.To fix the ideas we note that the data set we study contains records for 1412 genes and six measurements are available for each gene.For simplicity we ignore one gene for which all six measurements take the same value.Therefore the total number of errors is 8466.After applying the CEM algorithm initialized from 20 different points and selecting the best solution, the errors are split in the groups R g and R b .We note that R g contains 6164 errors for which the mean is 0.054 and the variance is 0.003.The rest of 2302 errors are grouped in R b , their mean is 0.390 and the variance is 0.045.
For m ∈ {1, . . ., 6}, we assign a gene to the set denoted R m g if at least m of its error labels are R g .For each m, the genes not assigned to R m g are included in R m b .This leads to a goodness-of-fit criterion: for a given value of m, accept that the sum-of-exponentials model fits well a gene if the gene belongs to R m g .For example, when choosing m = 6, the criterion is very selective in the sense that requires all Table 8: The goodness-of-fit of sum-of-exponentials model is evaluated with two different criteria for postnatal development of mouse cerebellum gene expression data.The first criterion compares with a threshold Th RSS the ratio between the sum of squared errors and the sum of squared measurements, for each gene.For the second criterion, the small magnitude errors are collected in R g , and the rest of errors are included in R b .Then R m g , 1 ≤ m ≤ 6, is the set of all genes for which at least m errors belong to R g .For each m, the complementary set of R m g is R m b .The contingency tables of the gene partitions determined by the two criteria are shown for different values of Th RSS and for m ∈ {4, 5, 6}.errors corresponding to the analyzed gene to be small in magnitude.Since we are interested only in those genes for which the number of small magnitude errors exceeds the number of large magnitude errors, we restrict the analysis to m ∈ {4, 5, 6}, and note that the cardinalities of the selected sets are . We can admit with high degree of confidence that a particular gene is properly selected with the criterion (11) if the gene is also included in the set R 6 g .In general, choosing a particular value for Th RSS leads to a partition of the genes into two different groups.Similarly, selecting the value of m leads to another two-groups partition of the gene set.The contingency tables, shown in Table 8, are very convenient for comparing the partitions obtained with various values of Th RSS and m.Recall that R 6 g set contains all genes for which the magnitudes of all errors are small.According to the results in Table 8, choosing Th RSS to be 0.01 or 0.05 implies that some genes from R 6 g are considered to have poor fit with sum-ofexponentials model.For Th RSS = 0.05, 58 genes from R 6 g are deemed as poor fit for sum-of-exponentials model, while 17 genes from R 6 b are assumed to fit well the model.When the decision is based on a threshold value Th RSS ≥ 0.1, all 378 genes in the set R 6 g are qualified as well fitted by the model.Remark from the last column in Table 8 that for Th RSS ≤ 0.1 none of the genes with less than four errors in R g are selected by the criterion (11).These properties recommend to choose Th RSS = 0.1.To illustrate the accuracy of modelling gene expressions with sum of exponentials, we plot in Figure 1 the original measurements and the optimal model for two genes.
Based on criterion (11) with Th RSS = 0.1, we select 489 out of 1412 genes of cerebellum data set, and further cluster them.We apply the same procedure for the gene expressions measured during postnatal development of mouse dentate gyrus, and the number of genes for which the model with sum of exponentials fits well is 561 out of 1412.

Clustering the selected genes
We use CEM to group into N TC = 3 clusters the time constants associated with the 489 genes selected from cerebellum data set.After running the algorithm from 40 different initialization points, the resulting centroids are T 1 = 1.88,T 2 = 6.16,T 3 = 15.41.We note that the time constants are clustered after eliminating the outliers which are the values smaller than 1, or larger than 35.Based on the method described in Section 3, we use the clusters already found for time constants to group the genes, and the resulting number of gene clusters is N GC = 6.Recall that two genes are pooled together in the same cluster when both sets of time constants are associated to the same sequence of centroids.We assign to the gene clusters labels from C 1 to C 6 , and list for every cluster the corresponding sequence of centroids: Observe that there is no gene for which the set of time constants contains representatives from all clusters whose centroids are T 1 , T 2 , T 3 .Then we focus on the time constants corresponding to the 561 genes selected from mouse dentate gyrus data set.After dropping the outliers, the time constants are grouped by CEM in three clusters whose centroids are T 1 = 1.95,T 2 = 5.58, and T 3 = 19.38.Remark that the centroids are close to those determined for the cerebellum data.Moreover, the number of gene clusters is also six, and the sequence of centroids corresponding to every gene cluster is the same as for the cerebellum data.

Enrichment of functional categories
We briefly revisit some statistical aspects regarding the enrichment of functional categories in experiments with microarray data [33].To fix the ideas, we assume that the total number of genes on the microarray is M, and only a proportion q of them belongs to functional category F .Therefore, qM genes are in category F , and (1 − q)M genes are not in this category, where 0 ≤ q ≤ 1. Suppose that the number of genes in cluster C is m.We randomly choose m out of M genes, and denote by V the random variable which counts how many genes from F are among the m selected genes.The probability function of V is modelled by the hypergeometric distribution H(M, m; q) [34].If v(C, F ) denotes the number of genes that belong both to cluster C and to category F , then F is enriched in C when v(C, F ) is such that Prob{V > v(C, F )} < 0.05, where Since M takes large values for microarray data, numerical difficulties occur when computing the expression above.Relying on the theoretical result [34], which claims that the limit of the hypergeometric distribution H(M, m; q) as M → ∞ is the binomial distribution Bi(m, q), the probability function of V is modelled by Bi(m, q), and ( 12) will be replaced by We use next both (12) and ( 13) to investigate the enrichment of functional categories in the examples from the developmental biology.We report in Tables 9, 10, 11, and 12 the results obtained when testing the enrichment of functional categories in clusters found as described above.The following supplementary conditions are added to the tests in (12) and ( 13): to decide that the functional category F is enriched in cluster C, it is necessary that at least 10 genes on the microarray belong to F , and at least 2 genes from C belong to F .To refer to various functions, we use in Tables 9, 10, 11, and 12 the acronyms defined at http://physiolgenomics.physiology.org/cgi/content/full/4/2/155/DC1/2.
Comparing the results in Tables 9 and 10, we remark that almost the same functions are found to be enriched by statistical tests (12) and (13).Applying the binomial distribution model seems to be more restrictive in the sense that the functions C and CY are reported as enriched in Table 9, but not in Table 10.
According to the results shown in Tables 11 and 12, some functions are enriched in the clusters of gene expressions for dentate gyrus data.Observe for this data set that exactly the same functions are found to be enriched when the test is performed with (12), and with (13).Moreover, there exist some functional categories enriched both in clusters of cerebellum data and in clusters of dentate gyrus data: STK, SY, B, PS, and GANN.
The very last observation has a special importance in comparison with the results reported in [1,3] where enriched functions have been found only for the clusters in cerebellum data.Our clustering procedure allows to find enriched functions for both cerebellum data and dentate gyrus data, and some of these functions are the same for the two data sets.It is widely accepted in biology that cerebellar cortex and the dentate gyrus have common features [3], therefore the findings of the newly introduced algorithm can be explained based on biological knowledge.The biological significance of our results still remains to be further investigated in the future.

CONCLUSION
In this paper, we propose a new approach for clustering the gene expression data that are time series.The key step of the algorithm consists in fitting a sum of exponentials to the nonuniformly sampled points of every time series.The optimal number of exponentials is inferred relying on a new information theoretic criterion which is defined based on NML estimator [28,29].The estimation method is tested with carefully crafted data, and the results are compared with Cramér-Rao lower bound.The conclusions drawn for simulated data to define a criterion that determines when the sum of exponentials model fits well the measured data.The clustering procedure is applied for data sets from developmental biology [1,3], and the enrichment of functional categories is investigated.

APPENDICES A. THE SEPARABILITY OF PARAMETERS
FOR THE LS PROBLEM Nonlinear LS problems are in general difficult to solve, and it is of interest to reduce the problem to one of a smaller dimensionality if the optimization can be analytically performed over a subset of variables, for fixed values of the remaining variables.This can be readily achieved for model (2) by restating the LS problem as follows.
Minimize ε 2 , where ε is the error vector ε= z − B(b)a and where b is the vector [β 1 , . . ., β p ] , a is the vector [α 1 , . . ., α p ] , and B is the matrix with entries b i j = exp(−t i β j ), for 1 ≤ i ≤ n and 1 ≤ j ≤ p. Recall that the vector of measurements is z = [z(t 1 ) • • • z(t n )] .Assume that n ≥ p, which is the case of interest where the number of exponentials does not exceed the number of measurements.
For a given b, we denote by â(b) the vector that minimizes ε 2 , which results to be the linear LS solution

B. COMPUTATION OF CRAM ÉR-RAO LOWER BOUND (CRLB)
Since the samples are statistically independent, we obtain the following expression of the log-likelihood function, under the hypothesis of white Gaussian noise with zero mean and variance σ 2 : where z is the vector of measurements, and the set of parameters θ γ is given in (6).In this section, we drop the index γ, and write θ instead of θ γ .The expression of the Fisher information matrix is We focus on the entries of the block J: where the derivatives are evaluated at the true value of θ , and the expectation is taken with respect to the probability density function of z conditional to the model parameters [23].θ and θ m are the 'th and the m'th entries of θ , where 1 ≤ , m ≤ 2p.A well-known result on CRLB for general Gaussian case [23] implies  When the set of parameters for the signal model is θ γ (7), let G be the block of the Fisher information matrix similar to J .For simplicity, we use the notation θ instead of θ γ .Based on the result from [23] on vector parameter transformations, the relationship between the entries of the matrices G and J is given by each point b in the grid, the linear parameters â(b) are fitted as shown in Appendix A by minimizing the sum of residual squares.The starting point is chosen to be the pair (â(b), b) that minimizes the sum of residual squares over all points of the grid defined in the space of b parameters.The selected pair (â(b),
â(b) = (B B) −1 B z, if the matrix B B is nonsingular.The matrix B has all columns independent, since every p × p minor is a generalized Vandermonde matrix with nonzero determinant.Thus, the matrix B B is positive definite and nonsingular, and â(b) = (B B) −1 B z is the unique minimizer of ε 2 for a given b vector.Evaluating ε 2 at the value â(b), we find ε 2 = z (I − B(B B) −1 B )z, which is now a nonlinear criterion in the reduced set of parameters b, much easier to solve when compared to the original problem having as parameters the entries of the vectors a and b.
. It is clear that Input: a data set containing gene expressions that are time series.It is not necessary that the time sampling points are the same for all genes, and the number of samples can vary from one gene to another. 1.For every gene, the available measurements are denoted by y(t 1 ), . . ., y(t n ), where the time points t 1 , . . .t n are generally nonequidistant.For p = 1 : p max , fit a sum of p exponentials to the data y(t 1 ), . . ., y(t n ) (Section 2.3); evaluate the nMDL criterion (10).End Choose p * to be the value of p that minimizes the nMDL criterion.The estimated parameters are the gains α1 , . . ., αp * , and the time constants are τ1 , . . ., τp * .If the goodness-of-fit criterion (11) is satisfied, include the time constants τ1 , . . ., τp * in the set T .Else Label the current gene with zero.End End 2. After eliminating the outliers, group the objects from T into N TC clusters by applying the CEM algorithm (Section 3.1).

Table 2 :
Parameters, Cramér-Rao lower bounds, and estimation results for 50 trials when ten different gene prototypes are considered.Noise standard deviation is σ = 10 −3 .

Table 3 :
The percentage of estimations for model order when applying MDL and nMDL criteria.The value σ of noise standard deviation is given for every experiment.The symbol * is used to indicate the "true" model order.

Table 4 :
Results obtained when clustering the time constants estimated for 25 microarray copies when each copy contains the measurements of exactly 2 genes from every prototype G 1 − G 10 .The values represented with shades are counts for the time constants that are properly assigned to clusters, and the numbers represented without shades count for misclassified time constants.The structure parameter is estimated by applying the nMDL criterion.

Table 5 :
Counting the well-classified and misclassified time constants when the data sets are the same as for the results reported in Table4, and the structure parameter is estimated with MDL criterion.The convention for using shades is the same as in Table4.

Table 6 :
The centroids for clusters of estimated time constants.For the scenario "one clustering for each copy," the mean and standard deviation are reported for every centroid.The structure parameter is estimated with the nMDL criterion.

Table 7 :
The centroids for clusters of estimated time constants.For the scenario "one clustering for each copy", the mean and standard deviation are reported for every centroid.The structure parameter is estimated with the MDL criterion.

Table 9 :
Enrichment of functions in clusters of gene expressions measured during mouse cerebellar development.For each cluster C and for each functional category F , the number of genes v(C, F ) that belong both to C and F is given.The enrichment is marked with shades.Statistical tests are based on hypergeometric distribution.The following acronyms are used for the functional cate-

Table 10 :
Enrichment of functions in clusters of gene expressions measured during mouse cerebellar development.Statistical tests for enrichment are based on binomial distribution.The acronyms for the functional categories are the same as in Table9.

Table 11 :
Enrichment of functions in clusters of gene expressions measured during mouse dentate gyrus development.For each clus- ter C and for each functional category F , the number of genes v(C, F ) that belong both to C and F is given.The enrichment is marked with shades.Statistical tests are based on hypergeometric distribution.The following acronyms are used for the functional categories: STK: serine/threonine kinase, TES: testis, SY: synaptic component, CR: chromosome component, E: electron transfer, SEC: secretory pathway, B: brain and neuron, L: lipid metabolism, GANN: oncogenes and their relates, PS: ribosomal proteins, A: amino acid metabolism, CHAP: chaperonines.

Table 12 :
Enrichment of functions in clusters of gene expressions measured during mouse dentate gyrus development.Statistical tests for enrichment are based on binomial distribution.The acronyms for the functional categories are the same as in Table 11.