DNA Microarray Data Analysis: A Novel Biclustering Algorithm Approach

Biclustering algorithms refer to a distinct class of clustering algorithms that perform simultaneous row-column clustering. Biclus-tering problems arise in DNA microarray data analysis, collaborative ﬁltering, market research, information retrieval, text mining, electoral trends, exchange analysis, and so forth. When dealing with DNA microarray experimental data for example, the goal of biclustering algorithms is to ﬁnd submatrices, that is, subgroups of genes and subgroups of conditions, where the genes exhibit highly correlated activities for every condition. In this study, we develop novel biclustering algorithms using basic linear algebra and arithmetic tools. The proposed biclustering algorithms can be used to search for all biclusters with constant values, biclusters with constant values on rows, biclusters with constant values on columns, and biclusters with coherent values from a set of data in a timely manner and without solving any optimization problem. We also show how one of the proposed biclustering algorithms can be adapted to identify biclusters with coherent evolution. The algorithms developed in this study discover all valid biclusters of each type, while almost all previous biclustering approaches will miss some.


INTRODUCTION
One of the major goals of gene expression data analysis is to uncover genetic pathways, that is, chains of genetic interactions.For example, a researcher may be interested in identifying the genes that contribute to a disease.This task is difficult because subgroups of genes display similar activation patterns only under certain experimental conditions.Genes that are coregulated or coexpressed under a subset of conditions will behave differently under other conditions.Finding genetic pathways may therefore benefit from identifying clusters of genes that are coexpressed under subsets of conditions as opposed to all conditions.
Gene expression data is typically arranged in a data matrix, with rows corresponding to genes and columns corresponding to experimental conditions.Conditions can be different environmental conditions or different time points corresponding to one or more environmental conditions.The (n, m)th entry of the gene expression matrix represents the expression level of the gene corresponding to row n under the specific condition corresponding to column m.The numerical value of the entry is usually the logarithm of the relative amount of the mRNA of the gene under the specific condition.By simultaneously clustering the rows and columns of the gene expression matrix, one can identify candidate subsets of conditions that may be associated with cellular processes that exhibit themselves only or identify subsets of genes that potentially play a role in a given biological process.Biological analysis and experimentation could then confirm the biological significance of the candidate subsets.
Biclustering was first described in the literature by Hartigan [1].It refers to a distinct class of clustering algorithms that perform simultaneous row-column clustering.The biclustering problems arise in microarray data analysis, collaborative filtering, market research, information retrieval, text mining, electoral trends, exchange analysis, and so forth.Cheng and Church were the first to apply biclustering to analyze DNA microarray experimental data [2].They introduced the term biclustering to denote simultaneous rowcolumn clustering of gene expression data.Biclustering algorithms are also known as bidimensional clustering, subspace clustering, and coclustering in other application fields.It should be clear that biclustering techniques produce local models, whereas clustering approaches compute global models.If we use a clustering algorithm on the rows of the gene expression matrix, a given gene cluster is defined using all the conditions.In contrast, a biclustering technique will assign a gene to a bicluster based on a subset of conditions.Furthermore, when a clustering algorithm is applied to the rows of the gene expression matrix, it assigns each gene to a single cluster.Biclustering techniques on the other hand identify clusters that are not mutually exclusive or exhaustive.A gene may belong to no cluster, one or more clusters.
Cheng and Church compute the residue of each element of a submatrix of the gene expression matrix by subtracting from that element the means of all elements in its corresponding row and column and by adding a constant equal to the overall mean of all elements in the matrix.They define a bicluster to be a submatrix formed with a subset of rows and columns of the gene expression matrix with a low meansquared residue score and used a greedy approach to find biclusters.After that, many other approaches were proposed in the literature [3][4][5][6][7][8][9].For example, Tanay et al. [3] mapped expression data onto bipartite graphs and used probabilistic graph techniques to find biclusters.Getz et al. [4] devised a coupled two-way iterative clustering algorithm to identify biclusters.Lazzeroni and Owen [5] introduced the notion of a plaid model, which describes the input matrix as a linear function of variables corresponding to its biclusters.Ben-Dor et al. [6] defined a bicluster as an order-preserving submatrix, or equivalently, a group of genes whose expression levels induce some linear order across a subset of the conditions.Yang et al. [9] used tree traversal with two-way pruning of maximum coherent sets for each pair of genes and each pair of conditions, see [10] for many other approaches.
Most of these previous techniques search for one or two types of biclusters among four that have been identified in the literature [10]: biclusters with constant values, biclusters with constant values on rows or columns, biclusters with coherent values, and biclusters with coherent evolution.Most previous techniques are also greedy and will miss meaningful biclusters.Many of these pioneering approaches used a cost function to define biclusters.In many cases, the cost function will measure the square deviation from the sum of the mean value of expression levels in the entire bicluster, and the mean values of expression levels along each row and column in the bicluster.
Our objective here is to develop a biclustering algorithm that is able to discover all biclusters in a given data set of any type defined by the user in a timely manner.The proposed biclustering algorithm approach is different from previous ones in several ways.Firstly, the proposed approach can be used to find the exact number of all valid perfect biclusters in each type and identify all of them in a timely manner.Secondly, the proposed approach uses basic linear algebra and arithmetic tools and avoids the need for heuristic cost functions of prior approaches that can miss some pertinent biclusters.More specifically, our approach relies on the manipulation of elementary binary matrices with entries equal to "0" or "1."Finally, our approach allows the user to view biclusters under any specific experimental condition.
Observe also that our procedures will produce more biclusters than most of the other biclustering approaches since they identify all biclusters of a given type.As mentioned above, this reduces the probability of missing a bicluster of potentially significant biological value.On the other hand, this also increases the number of biclusters that a biologist needs to further examine.So far, we have not identified an effective criterion for ranking biclusters according to their potential biological significance.
The rest of this paper is organized as follows.After a quick description of the gene expression matrix in Section 2, we develop the proposed biclustering algorithm in Section 3. In Section 4, we show some simulation results and we compare the proposed biclustering algorithm with previous ones.

GENE EXPRESSION MATRIX
A DNA microarray data can be represented as an N × M matrix A whose rows represent the genes, columns represent the experimental conditions, and real-number entries a nm represent the expression level of gene n under condition m as illustrated in We can also partition the matrix A into rows, or into columns as illustrated by ( In (2), where 1 ≤ n ≤ N and 1 ≤ m ≤ M. The row vector R n corresponds to the expression levels of the nth gene under M conditions.The column vector C m corresponds to the expression levels of the N genes under the mth condition.From (1), we can also define two additional vectors: the row vector Conditions(1 × M) and the column vector Genes(1 × N).They are both label vectors and they are defined to keep track of every condition and gene:

THE PROPOSED BICLUSTERING ALGORITHM
Our proposed biclustering algorithm works as follows.After solving the problems of missing values, noise corruption using any of the known techniques, or a simple approach that we describe below, the gene expression matrix is written as the sum of the product of each of its distinct elements with an elementary matrix.Each elementary matrix is binary, that is, its elements are either "1" or "0."By performing elementary row or the column operations on the elementary matrices, it becomes easy to identify all perfect biclusters in a timely manner.

Data conditioning
The first part of the proposed biclustering algorithm consists of performing the data conditioning due to the fact that we are not only working with noisy data, but also DNA experimental data contains missing values.
Many techniques to recover missing values have been developed in the literature, for example, [11,12].Since the recovery of missing values is not our main focus in this study, we have used the zero method, that is, replacing each missing value by zero.
Several techniques have been proposed in the literature, to deal with noise, including many data quantization techniques.In this study, we have used the following approach.First, we identify the number L of distinct values α l that exist in the gene expression matrix A. We assume that the values α l are rank-ordered according to their magnitudes, that is, α l < α l+1 .Next, we redefine α l using where The interval [b0 b L] is then divided into L equal intervals: Finally, a new data matrix is obtained by quantizing each expression value a nm using Algorithm 1. Specifically, if a nm falls in the interval [bl−1 b l[, then it is quantized to the centroid α l of that interval.One advantage of using this quantization approach is that it does operate on all the data of the matrix.Therefore the biclusters that are present in the original set of data are not likely to be destroyed.All it does is reducing the number of original biclusters and increasing their size by merging some of them together.This happens because this first global manipulation reduces the effect of noise in the entries of the gene expression matrix and the set of data becomes more uniform.We have also found this quantization approach to be useful in extending our basic biclustering approaches to deal with the coherent evolution case, as we will explain below.

Input A = microarray data
Algorithm 1: Data quantization procedure.
Note that one can also choose to perform the same manipulation described above gene by gene, that is, by performing the same manipulation on each row of the gene expression matrix separately.One can also use any other quantization method, such as [13].
Finally, note that it is important in practice to assess the effects of the quantization step on the biclusters that are identified by the procedures that we discuss below.This can be done by performing a simple sensitivity analysis in which the parameter e is perturbed about its selected value.It is enough to consider one or two values for e below and above its selected numerical value as determined above.Only biclusters that continue to be identified by the algorithms as e is varied should be retained for further examination.Note that the number of genes in these biclusters may also change.The user therefore needs to determine a rule for dealing with genes that may be dropped from the biclusters as e changes.The most conservative approach would be to retain only the genes that remain in the biclusters for all values of e around its selected value.

Gene expression matrix decomposition
The second part of the proposed biclustering algorithm consists of writing matrix A as the sum of the products of each of its distinct elements with a corresponding elementary matrix.It is the first important step of the proposed biclustering algorithm because after the gene expression matrix is written as mentioned above, obtaining perfect biclusters is straightforward.This is due to the fact that the elementary matrices consist of "0's" and "1's." Given that A is made up of L distinct values, A can be expressed using From ( 8), we observe that the A l 's are binary matrices as mentioned earlier.We can also partition the matrices A l as rows or columns as illustrated by ( 9) and ( 10), respectively: In ( 9) and (10), respectively, the row vectors r l n are binary 1 × M vectors and the column vectors c l m are binary N × 1 vectors.The row vector r l n corresponds to the nth row of the elementary matrix that is associated to the lth distinct element of the gene expression matrix.The column vector c l m corresponds to the mth column of the elementary matrix that is associated to the lth distinct element of the gene expression matrix.From ( 2)-( 10), we can derive the following relations: where Here, ones(K, L) denotes a K × L matrix of ones.Finally, note that since we are dealing with binary numbers, the number of distinct combinations that the row vector r l n can take is less than or equal to 2 M − 1 and the number of distinct combinations that the column vector c l m can take is less than or equal to 2 N − 1.
Decomposing the gene expression matrix as shown above has many advantages.Firstly, as mentioned earlier, all subsequent algorithms operate on binary data.Thus we gain in terms of computational complexity and memory resources.Secondly, it allows the user to get more local information about the gene expression matrix in a simple way.For example, the ones in the binary row vector r l n show the positions (i.e., the conditions) at which the nth gene has the same expression value α l (which corresponds to the lth distinct element of the gene expression matrix) and its zeros show the position at which the same nth gene is not expressed at α l .On the other hand, the ones in the binary column vector c l m show subgroups of genes that have the same expression value α l (which corresponds to the lth distinct element of the gene expression matrix) under the same mth condition, and its zeros show the subgroup of genes that are not expressed at the same value α l under the same mth condition.Also, if one is given two genes with two different binary row vectors r l n and r l k associated with the same expression value α l , one can identify the position at which both genes are expressed simultaneously at α l by computing the elementwise product of r l n and r l k .The result will be a binary row vector with its ones showing the positions at which both genes are expressed simultaneously at α l .As will become clear below, this observation plays a critical role in the elaboration of the proposed biclustering algorithm.Finally, observe that the decomposition is also a powerful gene expression visualization tool.

Biclusters identification
The third part of the proposed algorithm consists of identifying the four types of biclusters from the gene expression matrix.Firstly, we develop three simple algorithms that can be used to extract all biclusters with constant values, biclusters with constant values on columns, and biclusters with constant values on rows.Secondly, we show how one of these algorithms can be modified to extract biclusters with coherent values.Finally, we describe how the modified algorithm, when coupled with tuning parameter e(e = (b L − b 0 )/L) defined above, can predict biclusters with coherent evolution from a set of data.

Biclusters with constant values
In a DNA microarray experimental data, a perfect bicluster with constant values is any submatrix B = [a i j ] of A with dimension I × J whose elements are constant: where 1 ≤ i ≤ I and 1 ≤ j ≤ J.Such matrices reveal subgroups of genes with constant expression levels within a subgroup of conditions or vice versa.
From the gene expression matrix decomposition performed above, such matrices can be obtained by analyzing each elementary matrix A l separately to obtain subgroups of genes that have constant expression level α l under different conditions.Such matrices will therefore correspond to subgroup of matrices of each elementary matrix whose elements are only the binary number "1."To identify such matrices, we proceed by identifying the set of distinct rows of each elementary matrix that are nonzeros.The sum of the cardinalities of the sets of distinct rows of each of the elementary matrices A l will also be equivalent to the exact number of biclusters with constant values that can be found in a set of data.
In other words, since A l is a binary matrix, and since the number of genes N is always greater than the number of conditions M, the number of biclusters (N b ) with constant values in a DNA microarray experimental data can be defined using where P l is the number of distinct nonzeros rows r l i of each elementary matrix A l .Now note that each distinct nonzeros row r l i of each elementary matrix A l constitutes the principal row element of the ith bicluster B l i of the elementary matrix A l considered.Therefore, in order for any other row r l n of the elementary matrix A l to belong to the ith bicluster, (15) has to be true: Input: A = quantized microarray data Output: Algorithm 2: Algorithm for finding biclusters with constant values.
where 1 ≤i ≤ P l , 1 ≤ n ≤ N, 1 ≤ l ≤ L, and "• * " denotes the elementwise product of the two given row vectors.Algorithm 2 is then used to extract biclusters that have constant expression level α l .

Biclusters with constant values on columns
A perfect bicluster with constant values on a column is any submatrix B = [a i j ] of A with dimension I × J which has one of the following forms: The general form can be represented using We observe that if β j = 0 in the additive model or β j = 1 in the multiplicative model, we have a i j = μ.Thus some perfect biclusters with constant values are also subclasses of biclusters with constant values on columns.
In a DNA microarray experimental data, biclusters with constant values on columns identify subgroups of conditions within which a subgroup of genes present similar expression values assuming that the expression values may differ from condition to condition.
Unlike Algorithm 2 which dealt with the elementary matrices A l one at a time, identification of biclusters with constant values on columns must examine all elementary matrices at the same time.It proceeds by identifying the exact number of distinct columns of the entire elementary matrices.The number found corresponds to the exact number of biclusters with constant values on columns that can be found in a set of data.Each distinct column also defines the membership in a bicluster as shown below.From the gene expression matrix decomposition performed above, the number of biclusters (N b ) with constant values on columns is given by where P c is the number of distinct nonzeros columns c j of the entire elementary matrices A l .Once more, each distinct column c j of the entire elementary matrices A l constitutes the principal column element of the jth biclusters B j .Therefore, in order for any other column c l m of any elementary matrix A l to belong to the jth bicluster, (19) has to be verified: where 1 ≤ j ≤ P c , 1 ≤ m ≤ M, and 1 ≤ l ≤ L. Algorithm 3 is then used to extract biclusters that have constant values on columns.

Biclusters with constant values on rows
A perfect bicluster with constant values on rows is any submatrix B = [a i j ] of A with dimension I × J which has one of the following forms: The general form of such biclusters can be represented using We observe that if α i = 0 in the additive model or α i = 1 in the multiplicative model, we have a i j = μ.In a DNA microarray experimental data, biclusters with constant values on rows represent subgroups of genes with similar expression level across a subgroup of conditions, allowing the expression levels to differ from gene to gene.
Identification of such biclusters uses the same methodology as in Algorithm 3. Algorithm 4 operates on the rows of all the elementary matrices at the same time.It proceeds by identifying the exact number of distinct rows of the entire elementary matrices.Once more, the number found corresponds to the exact number of biclusters with constant values on rows that can be found in a set of data.Each distinct row also defines the membership in a bicluster as shown below.
From the gene expression matrix decomposition performed above, the number of biclusters (N b ) with constant values on rows is given by where P r is the number of distinct nonzeros rows r i of the entire elementary matrices A l .Each distinct row r i of the entire elementary matrices A l constitutes the principal row element of the ith bicluster B i .Therefore, in order for any other row r l n to belong to the ith bicluster, (23) has to be verified: where 1 ≤ i ≤ P r , 1 ≤ n ≤ N, and 1 ≤ l ≤ L. Algorithm 4 is then used to extract biclusters that have constant value on rows.

Biclusters with coherent values
A perfect bicluster with coherent values is any submatrix B = [a i j ] of A with dimension I × J which has one of the following forms: In this study, we will only deal with the additive model.From the above definition, we observe that the types of biclusters defined previously are particular cases of bicluster with coherent values.
(i) If α i = β j = 0, then a i j = μ and the bicluster has constant values.(ii) If α i = 0, then a i j = μ + β j and the bicluster has constant values on columns.(iii) If β j = 0, then a i j = μ + α i and the bicluster has constant values on rows.
In a DNA microarray experimental data, biclusters with coherent values represent subgroups of genes and subgroups of conditions with coherent values on both rows and columns.
Note that a bicluster B with coherent values can be viewed as the sum of three matrices: B 1 with constant values, B 2 with constant values on rows, and B 3 with constant values on columns, that is, . Therefore, to obtain perfect biclusters with coherent values from a DNA microarray experimental data, one of the following three approaches can be used.

Approach 1
The gene expression matrix A is first written as the sum of three matrices Z 1 , Z 2 , and Z 3 , where Z 1 is a matrix with constant values on rows, Z 2 a matrix with constant values on columns, and Z 3 = A − (Z 1 + Z 2 ).Next, use Algorithm 2 to extract all perfect biclusters with constant values from Z 3 .Finally, add to each entry of each of these biclusters the corresponding entry in (Z 1 + Z 2 ) to obtain the biclusters with coherent values in A.

Approach 2
The gene expression matrix A is first written as the sum of three matrices Z 1 , Z 2 , and Z 3 , where Z 1 is a matrix with constant values, Z 2 a matrix with constant values on rows, and Z 3 = A−(Z 1 +Z 2 ).Next, use Algorithm 3 to extract all perfect biclusters with constant values on columns from Z 3 .Finally, add to each entry of each of these biclusters the corresponding entry in (Z 1 + Z 2 ) to obtain the biclusters with coherent values in A.

Approach 3
The gene expression matrix A is first written as the sum of three matrices Z 1 , Z 2 , and Z 3 , where Z 1 is a matrix with constant values, Z 2 a matrix with constant values on columns, and Z 3 = A − (Z 1 + Z 2 ).Next, use Algorithm 4 to extract all perfect biclusters with constant values on rows from Z 3 .Finally, add to each entry of each of these biclusters the corresponding entry in (Z 1 + Z 2 ) to obtain the biclusters with coherent values in A.
In this study, we use the third approach.The choice of the matrix Z 1 + Z 2 which has constant values on columns is not arbitrary.It must be constructed using each row of the gene expression matrix A that is also part of the bicluster with coherent values as explained below.
Property 1.Let X be a matrix that contains a bicluster with coherent values embedded within its structure.Subtract from X a matrix Y that has constant values on columns, and is constructed using a row of X that is also part of the bicluster with coherent values.The resulting matrix Z contains a bicluster with constant values on rows embedded within its structure.Furthermore, the location of the bicluster with constant values in Z corresponds to that of the bicluster with coherent values in A.
Proof.Without loss of generality, consider a matrix X that includes a bicluster with coherent values embedded in it: The bicluster with coherent values B = (α i + β j ) embedded within the structure of X is Thus we can construct the matrix Y that has constant values on columns using either the first, the third, or the fourth row of X.Let us use the first row of X.Therefore, we have By computing Z = X − Y , we have Observe that Z has a bicluster Bc with constant values on rows embedded within its structure.Furthermore, the location of Bc corresponds to that of the bicluster with coherent values in X: In [14], we provide a development of all of the other approaches.
Since we do not have any knowledge about the rows of the gene expression matrix A, the intuitive approach is to use an iterative multistep approach.Specifically, we iteratively construct the matrix Z 1 + Z 2 with constant values on columns using each row of A. After each iteration, we compute Z 3 = A − (Z 1 + Z 2 ) and use Algorithm 4 to extract all perfect biclusters with constant values on rows from Z 3 .Finally, we add to each entry of each of these biclusters the corresponding entry in (Z 1 + Z 2 ) to obtain the biclusters with coherent values in A.
From the proof of the above property, we observe that there are many ways to construct the matrix Z 1 +Z 2 with constant values on columns and obtain the same bicluster with coherent values.Therefore, to avoid redundancy and gain in computational time, we need a strategy that prevents the algorithm from identifying a bicluster more than once.The strategy should take into account the fact that a row of the gene expression matrix can be part of more than one bicluster with coherent values.Such strategy is still under investigation.

Biclusters with coherent evolution
The last type of biclusters addressed in this study is the set of biclusters that exhibit coherent evolution.Identifying such biclusters can be helpful in the sense that in some applications, one might be interested in looking for subgroups of genes that are upregulated or downregulated across a subgroup of conditions without taking into account their actual expression values.
To extract such biclusters from a DNA microarray experimental data, we use the following approach.First, we tune parameter e(e = (b L − b 0 )/L) defined in Section 3.1.Second, we use the definition of perfect biclusters with coherent values to obtain biclusters with coherent values from the new set of data.The location of the perfect biclusters obtained from the new set of data corresponds to that of potential biclusters with coherent evolution in the original set of data.Finally, we use a merit function to validate all resulting potential biclusters as we explain below.
By tuning parameter e defined in Section 3.1, we decrease the number L of distinct values contained in the original set of data.Thus the resulting new set of data is more uniform than the original one.By applying the algorithm that extracts biclusters with coherent values to the new set of data, we obtain perfect biclusters with coherent values.A few examples are shown and discussed below in Section 4.2.After tuning, extraction, and matching of the set of perfect biclusters obtained from the new set of data with their equivalent in the original set of data, we obtain subgroups of genes with expression levels that evolve coherently or stay constant across a subgroup of conditions regardless of their expression values.In some cases, we get biclusters with 1 or 2 imperfections.By imperfection we mean a gene with expression levels that do not evolve coherently with those of all other genes for a few conditions.
In this study, we have used the same merit function as previous researchers [10] to validate potential biclusters with coherent evolution.Specifically, we adopt the mean-squared residue function H defined by In (30), r(a i j ) = a i j − a iJ − a I j + a IJ is the residue function, is the mean of the ith row in the bicluster, is the mean of the jth column in the bicluster, and is the mean of all the elements of the bicluster.The residue of perfect biclusters is zero, so is their meansquared residue.In order to validate a bicluster, we define a threshold δ and all qualified biclusters must verify: (34)

Complexity analysis
We can easily estimate the complexity of the proposed approach.Recall that N is the number of rows of the gene expression matrix A, M is the number of columns in A, and L is the number of distinct values in A. Algorithm 1, which is used for data quantization, requires about (N × M × L) operations.One has to note that this step is optional.After data quantization, we perform the matrix decomposition that requires about (N × M × L) operations.Algorithm 2 which is used to extract biclusters with constant values uses O((N ×M +N +K +K ×M)×L×N b ) operations because we perform N × M binary multiplications, N comparisons, and K assignments L × N b times.Here, N b is the number of biclusters and K is the number of times ( 15) is verified.It can be similarly verified that the complexities of Algorithms 3 and 4 are, respectively, O(( where K 1 and K 2 are the number of times (19) and (23) are verified.
From the above observations, the complete biclustering approach has complexity of O(N × M × L × N b ).Therefore, The proposed biclustering algorithm is less complex than the FLOC algorithm proposed by Yang et al. which has complexity O((N + M) 2 × K × P), where P is the desired number of biclusters and K is the number of iteration till the end.FLOC was shown by Yang et al. to be less complex than the Cheng-Church algorithm [9].

RESULTS
Let us conclude by discussing some of the results that we have obtained.As in [13], we have implemented the proposed biclustering algorithm in Matlab and tested it on the yeast gene microarray data that can be found at [15].The data consists of 2884 genes and 17 conditions.We have obtained the following first results.Initially, the data contained L = 206 distinct values.

First set of results
In the first set of results that we report here, we set b Because of the large number of biclusters found, we will present here a few illustrative results that will help the reader to grasp the magnitude of the problem and the nature of the results produced by the algorithm.Figure 1 shows an example of perfect biclusters with constant values, perfect biclusters with constant values on rows, and perfect biclusters with constant values on columns obtained.Figure 2 shows an example of perfect biclusters with coherent values obtained.

Second set of results
In the second set of results that we report, we explore the effect of two parameters: parameter e that defines the number of distinct values of the data set and threshold δ that qualifies the biclusters obtained.
For the threshold δ, we simply compare the residue of the biclusters obtained with the average residue of the Cheng-Church algorithm (204.293), and the average residue of the biclustering algorithm defined by Yang et al. (187.543)[9].
To explore the effect of e, we successively tuned its value from 2.8883 as initially defined to about 40.It is obvious that by increasing the value of e, the size of the biclusters obtained will increase and the probability of having the biclusters affected by imperfection will also increase.Figure 3 shows an example of biclusters with coherent evolution obtained without any imperfection.Thus, there is no need to use the merit function for validation.Figure 4 shows an example of perfect biclusters with coherent values obtained in the new data set after e is tuned up. Figure 5 shows the equivalent bicluster with the original data set.We observe a few imperfections, and thus need to use the merit function for validation.
For comparison, we select δ = 186.543,a value that corresponds to the average value chosen by Yang et al. [9], and we set e = 25.In [9], Yang et al. identified 100 biclusters with an average of 195 genes and 12.8 conditions.In contrast, our procedure identified 258 biclusters with an average of 204 genes and 13 conditions or more.On the other hand, Cheng and Church identified 100 biclusters with an average of 167 genes and 12 conditions and an average value of δ = 204.294.Clearly, our algorithm identifies more biclusters for the same threshold value δ.We discuss the biological significance of the biclusters that the procedure identified in the next subsection.
Note that the data conditioning and decomposition steps of our procedure took approximately 250 seconds to process the yeast data found at [15].It took less than 10 seconds to identify a bicluster.Thus its running time is better than that of [2], which reportedly takes 300-400 seconds to find a single bicluster, and is comparable to that of [16].

Biological significance
Since our ultimate goal is to be able to uncover genetic pathways from the set of biclusters that our methods produce, we need to investigate the biological significance of these biclus-ters.Ideally, the investigation would also yield a criterion for ranking biclusters according to their biological significance.As mentioned earlier, we have not succeeded so far in identifying such a criterion.We will therefore limit ourselves in this subsection to a discussion of the biological significance of the 258 biclusters mentioned in Section 4.2.The analysis of these biclusters is representative of what we have seen so far.It also illustrates the complexity of the additional investigations that must be performed on the biclusters once they have been identified.
A preliminary assessment of the biological significance of the biclusters is currently under investigation using the functional categories from the Comprehensive Yeast Genome Database (CYGD) [17,18].The CYGD database categorizes yeast genes into fine groupings using an annotation system  called FunCat, the functional classification catalog.More information can be found in [19].
Table 1 provides a preliminary biological significance analysis of the 258 biclusters in Section 4.2.The second row of Table 1 lists how many biclusters were found.Rows three through five show how many biclusters belong to one of 4 mutually exclusive categories.The third row shows how many of those biclusters contained genes that were all annotated under the same function.An example of a bicluster in this grouping would be three genes that all produce proteins whose main purpose is metabolism.The fourth row displays how many of the biclusters picked up only genes that were unclassified.The fifth row lists the number of biclusters that contained genes annotated to the same function as well as unclassified genes.Interestingly, the algorithm picks up biclusters that are completely comprised of functionally unclassified genes.Another unexpected result is that the algorithm is able to pick up biclusters that contained "mixed" data.Another unexpected result was the number of biclusters that contained "mixed" data.The appearance of such biclusters led us to pose several questions that we are attempting to answer in collaboration with researchers in the biological sciences.The genes in these mixed biclusters showed patterns of coherent evolution but did not fall necessarily in the same functional category.
The presence of these biclusters may be indicative of the fact that coregulated genes do not necessarily belong to the same functional category.On the other hand, it may indicate that these genes have other unknown functions or functions that were not captured in the annotation we used.It is also possible that the expression levels of certain genes that belong to a given functional category affect those of some other genes that belong to a different functional category.
Many of the mixed biclusters are of biological interest because they contain genes that either belong to a single functional category or are unclassified.Current investigations are attempting to determine whether the unclassified genes in these biclusters do actually belong to the same functional category as the others.With colleagues, we are examining the literature to identify the theorized functions of many of the unclassified genes that appear in mixed biclusters or biclusters with unclassified genes.We are also studying alternative gene annotation sources, such as GO-slim [20], to answer some of the questions that we posed here.

CONCLUSION
In this study, we developed an efficient biclustering algorithm that can be used to extract from a set of data biclusters with constant values, constant values on rows, constant values on columns, and coherent values.We also described an approach for finding biclusters with coherent evolutions, this approach combines the algorithm that finds biclusters with coherent values with adaptive gene expression level quantization procedure.Since completing this work, we have also developed an alternative fast and direct approach for finding all biclusters with coherent evolutions [21] with no imperfection.In contrast to prior work, our procedure is able to find all biclusters with constant values, constant values on rows, constant values on columns, and coherent values.Furthermore, it has similar or lower complexity than that of prior work.
Input: A = quantized microarray data Output: B j = biclusters with constant values on columns Begin, Compute:P c , c j , c l m For j = 1 to P c B j = []; For l = 1 to L For m = 1 to M If c j • * c l m == c j B j = B j Conditions(m); α l c j End End End; B j = [0 Genes]B j ; End End BeginAlgorithm 3: Algorithm for finding biclusters with constant values on columns.

Algorithm 4 :
Algorithm for finding biclusters with constant values on rows.
8883 and b l = b 0 + le = 2.8883l, with 1 ≤ l ≤ L. After data conditioning, we obtained L = 111 new distinct values.Then from our simulation, we obtained N b = 10225 biclusters with constant values, N b = 3391 biclusters with constant values on rows, and N b = 836 biclusters with constant values on columns.

Figure 1 :
Figure 1: Example of bicluster (a) with constant values; (b) with constant values on rows; and (c) with constant values on columns.

Figure 3 :
Figure 3: Example of bicluster with coherent evolutions obtained from the new data set after e is tuned up.

Figure 4 :Figure 5 :
Figure 4: Example of perfect biclusters with coherent values obtained from the new data set after e is tuned up.

Table 1 :
Biological analysis of the 258 biclusters with coherent evolutions.