Novel data fusion method and exploration of multiple information sources for transcription factor target gene prediction

,


Introduction
A central problem in molecular system biology is to understand the manner in which a cell operates its complex transcriptional machinery.At molecular level, transcriptional processes are largely controlled by transcription factors (TFs) that bind to gene promoters in a sequence-specific manner and, thereby, inhibit or promote the expression of their target genes.Collectively, these DNA-binding proteins and other molecules work together to implement the complex regulatory machinery that controls gene expression.Since large-scale understanding of transcriptional regulation is still severely limited even in lower organisms, it is of great importance to reveal these regulatory protein-DNA interactions genomewide.
Experimentally verified TF-binding sites (TFBSs) have been collected in databases [3][4][5] and recently developed experimental methods, such as ChIP-chip or ChIP-seq, are capable of measuring in vivo TFBSs in high-throughput manner.However, it is not possible to obtain sufficient coverage, that is, to screen all TFs under all conditions, using experimental methods alone.Therefore, the binding site prediction problem calls for computational methods.Computational predictions rely on sequence specificities that are typically taken from a database [4] or obtained as an output from a motif discovery method [6].Recent progress on experimental side has made it also possible to measure TF-binding specificities in high-throughput manner [7].The advent of these experimental techniques equips TF target gene prediction methods with much more accurate binding specificity models and, indeed, opens a whole new avenue for computational analysis of TF-DNA binding.
Sequence specificities alone, however, are not sufficiently informative to accurately predict TFBSs simply because the probability of observing an exact copy of a presumably functional binding motif in a genome by chance is remarkably high.A natural way to improve TF target gene predictions is to incorporate additional information into statistical inference of TFBSs.A number of additional data sources can be useful for this purpose, including, among others, information on coregulated genes, evolutionary conservation, physical binding locations as measured by ChIPchip or ChIP-seq, nucleosome occupancies, CpG islands, regulatory potential, DNase hypersensitive sites, and so on.Incorporating additional information sources to guide statistical inference has successfully been made use of in the context of motif discovery [8][9][10][11], but has not attracted enough attention in TF target gene prediction.We have recently developed a probabilistic TF target gene prediction method, ProbTF, which can incorporate practically any additional genome-level information source to predict TF target gene [12].
Statistical data fusion for TF target gene prediction becomes more challenging in the case of multiple information sources.Here we develop a new method for multiple data fusion and incorporate novel data sources into TF target gene prediction.Four genome-level additional information sources (i.e., information at the level of individual nucleotides), evolutionary conservation, nucleosome positioning data from a recently published computational method, regulatory potential, and DNA duplex stability, are employed here to improve TF target gene prediction, which is expected to be informative of binding sites as will be discussed shortly.Some of these and other individual data sources have already been shown to improve de novo motif discovery [8][9][10][11].Here we demonstrate how multiple data sources can be combined to make joint statistical inference of TF target gene.Integration of data sources that have a probabilistic interpretation is relatively straightforward [12], and for other data sources we convert the raw data into probabilities, or prior distributions, by extending a previously proposed Bayesian transformation method [11].In addition, for efficient use of DNA duplex stability data, we propose a simple heuristic that can assess the binding preference (single versus double-stranded DNA) for a TF from a set of known binding sites.Results on a carefully constructed set of verified binding sites in mouse genome [3,5,12] demonstrate that the new data fusion method that we propose here improves the performance of TF target gene prediction methods.We also demonstrate that a number of genome-level data sources, either alone or especially in combination, are highly informative of TF target gene.Consequently, our statistical data fusion method can gain valuable new insights into genomewide models of transcriptional regulatory networks.

Methods
Given the fundamental role of TFs in transcriptional regulation, we focus on predicting TF target gene.Because each individual data source is noisy and gives only a partial view of the underlying regulatory mechanisms, we focus on making statistical inference for TFBSs from multiple information sources.The essence of the data fusion problem that we encounter is illustrated in Figure 1, which shows four examples of verified binding sites from the test data set together with the associated additional genome-level data sources [12].The first row in each subplot shows the annotated binding site(s) for a TF in a gene promoter.The next rows (named by their TRANSFAC IDs, grey) show the log-likelihood scores of the position specific frequency matrix (PSFM) models to the Markovian background model φ.The following five rows show the additional data sources: probability of conservation (con.[13], green), regulatory potential (reg.[14], blue), nucleosome positioning signals predicted by two different methods (npy.[1] and nuc.[2], magenta), and DNA duplex stability (DNA.[15,16], red) score for each position of the sequences.The joint prior combined from all the explored additional data sources is shown in black in the last row.The median and mean of the scores for each data type applied to the sequences shown in Figure 1 are recorded in Table S1 in supplementary material available online at doi: 10.1155/2010/235795.
Figure 1 shows that the highest log-likelihood score is not always obtained at the annotated binding site.TFs are commonly associated with multiple PSFMs since one TF may allow certain variation in its binding motif.Thus, it can be difficult to combine predictions from multiple PSFMs given that these PSFMs may be extremely similar or different.This issue can be solved by, for example, ProbTF method, which implements an intuitive way of combining predictions by multiple PSFMs: ProbTF considers all possible numbers of nonoverlapping TFBSs in all possible locations and configurations and weights each configuration according to its probability.A more difficult problem is to decide that which of the peaks predicted by PSFMs correspond to real, functional binding sites.As illustrated in Figure 1, the PSFM-based profiles have relatively good sensitivity but poor specificity, which is common to many PSFMs.The lack of specificity can be greatly improved by genome-level data fusion, which forms the focus of this study.
Corresponding to what is known about transcriptional regulation, many of the verified binding sites typically have high degree of conservation [8] and high regulatory potential scores [14] and are typically free of stable nucleosomes (i.e., have low nucleosome occupancy scores) [17].Moreover, DNA double helix destabilization energies at TF binding sites are different from those at random sites [11].In particular, TFBSs tend to have high DNA duplex stability score if a TF prefers to bind both strands of the promoter sequence (Figures 1(a) and 1(b)) and low DNA stability score in the opposite case (Figures 1(c) and 1(d)) The above reasoning seems to provide a simple logic for filtering the real TFBSs.
However, correlation between TFBSs and any of the additional data sources cannot be expected to be perfect even from a biological point of view.For example, only about 50% of functional binding sites are assessed to be evolutionary conserved [18].The additional information sources are also noisy, regardless of whether they are experimental measurements or computational predictions.The only possibility is to make statistical inference, which takes the inherent Evolutionary conservation (green), regulatory potential (blue), two nucleosome positioning signals [1,2] (magenta), and DNA duplex stability data (red) are shown in the following five rows (abbreviated with con., reg., npy., nuc.and DNA., resp.).The joint prior from all the four additional data sources (black) is shown in the last row.TFs shown in panels (a) and (b) are assumed to bind to their corresponding sequences in a double-strand manner, while TFs in panels (c) and (d) bind in a single-strand manner.All plotted data are for mouse genome.
randomness into account, from multiple genome-level data sources.The rationale is that the accuracy of computational TF target gene predictions naturally improves when more (useful) information is incorporated into statistical analysis.

Probabilistic Framework for TF Target Gene Prediction.
We first describe the TF target gene prediction algorithm employed in this study (full details can be found from [12]).Let S = (s 1 , . . ., s N ) denote a single strand of a promoter sequence, where s i ∈ {A, C, G, T} and N is the length of the sequence (generalization to double-stranded DNA sequences is also possible but omitted here).Let Q denote the number of (unknown) binding sites and A the (hidden) start positions of nonoverlapping binding sites in sequence S; that is, if Nonbinding site (i.e., background) sequence locations are modeled by the dth order Markovian background model φ.Assuming that we have access to the d previous nucleotides before the start of the actual sequence S, the likelihood of a sequence S having no binding sites for any TF is We set d = 0 since that value provides the best results in [12].TFBSs are modeled with the standard PSFM model which is a product of independent multinomial distributions.Let θ(s i , j) = P θ (s i , j) denote the probability of observing nucleotide s i at the jth ( j = 1, . . ., l) position of θ, where l is the length of the motif.Assume a TF is characterized by M PSFMs, Θ = (θ (1) , . . ., θ (M) ).Define π ∈ {1, . . ., M} c as the configuration of motif models from Θ in A; that is, π i specifies the motif model θ (πi) , which begins from location a i and has a length l πi .Further, the probability of sequence S, given nonoverlapping motif positions and the motif and background models, is where The probability that a sequence S has c binding sites is obtained with Bayes' rule where the normalization factor is and N/l min is the maximum number of nonoverlapping motifs in an N-length sequence.As proposed in [12], the prior of the number of motif instances, P(Q = c | Θ, φ), is assumed to be independent of Θ and φ and has an exponential form where C = 2 N/lmin −1 i=0 κ i .We use κ = 0.5.This formula defines the (user definable) prior expectation of the number of binding sites in a given DNA sequence.Importantly, it does not incorporate any of the informative data sources studied here.This prior, primarily only, increases or decreases of the estimated binding probabilities, and as such has little effect on, for example, the ROC curves.The probability P(S | Q = c, Θ, φ) is obtained with the assumption that, for a fixed value of Q, the prior over binding site positions A and configurations π is uniform and inversely proportional to the number of different binding site positions and configurations.The probability P(S | Q = c, Θ, φ) is obtained by summing over all possible positions and configurations, and can be computed efficiently using a recursive formula [12].
Finally, the probability that a TF which is characterized by Θ binds to a promoter S, P(Θ → S | S, Θ, φ), is defined as the probability that at least one of the motif models in Θ has a binding site in ( Integration of additional data sources into the aforementioned probabilistic TF target gene prediction framework is carried out by assuming that the data sources are in the form of D = (P 1 , . . ., P N ) where P i is the probability that the ith base pair location is a binding site.D can be derived from a single data source or from multiple data sources (see subsections "DNA duplex stability data", "Nucleosome occupation data", and "Data integration method" of this section for details).Assuming that S and D are conditionally independent and the probability of D does not depend on the PSFM and background models, the probability of S and D given A, π, Θ, and φ is Following ( 1), the probability P(D | A, π) is modeled as and, thus, the joint probability P(S, D | A, π, Θ, φ) can be written compactly as where k=0 ((P aj +k )/ (1 − P aj +k )).Consequently, the same efficient recursive algorithm can be used to compute P(Θ → S | S, D, Θ, φ) (see [12] for more details).
Note that the choice of Markovian and PSFM models is arbitrary.Also note that since additional data are incorporated using probabilities of binding over the promoter sequence; we could also employ methods other than ProbTF.

Data Integration Method.
Define the mth additional genome-level data source (for a single gene promoter having length N) as D (m) = (P (m) 1 , . . ., P (m) N ), 1 ≤ m ≤ n.Denote the probabilities for position i from n different data sources as P i = (P (1)  i , . . ., Further, define a thresholded version of probabilities P (m) i as where T (m) is a threshold for the mth data source and is defined as a percentile q of the distribution of the mth data source.Then the thresholded scores for position i can be written as be the number of data sources that exceed their thresholds at location i, then the integrated probability for position i, P i , is calculated as The data integration method is parameterized by L 0 , L 1 , . . ., L n and q.Note that v i ∈ {0, 1, . . ., n} and L vi+1 ≥ L vi .It is also worth noting that the resulting probabilities do not include hard thresholding for any of the genomic locations although thresholding is involved in integration, and the use of thresholding during the construction is motivated by its simple yet powerful parametrization.
The data integration method is illustrated in Figure 2 for the case of two additional data sources with parameters L 0 = 0.5, L 1 = 0.7, L 2 = 1, and q = 0.9.For illustration purposes, both data sources are assumed to have uniform distribution and hence T (1) = T (2) = q.An illustration of the prior integration method.An illustration of the prior integration method for the case of two additional data sources.x and y axes correspond to the two data sources and z-axis corresponds to the integrated prior.
In the above genome-level data integration method there are n + 1 (n is the number of additional data sources) weighting parameters L 0 , L 1 , . . ., L n , and one threshold q for emphasizing the most informative binding locations.There are also two scaling parameters, a multiplicative factor a (m) , and a bias term b (m) , for each additional data source, and one scaling parameter, c, for combining other data sources with the TF target gene prediction analysis.These parameters are used to scale the original probability values into a proper range.In particular, the scaling parameters are used in the following way (for the mth data source): (11) where P(X ∈ B | R (m) (X)) is the probability that a DNA site X is a TFBS (X ∈ B) given the value of the mth raw data R (m) (X).For conservation and regulatory potential the original data are already in a probabilistic format, and for nucleosome and DNA stability data the conversion of the raw data into probabilities was described in the previous sections.Probability P i is the final integrated prior probability for position i after scaling, which is directly used in further TFBS prediction as explained, for example, in ( 6) and (7).
All the parameters needed in this study were chosen by a grid search method via optimizing receiver operating characteristic (ROC) curves, and the importance of each data source could be reflected by the multiplicative factor "a"; that is, the higher the multiplicative factor the less noisy or more important this type of data is."1-specificity" (x axis) and "sensitivity" (y axis) are used to draw the ROC curves according to specificity = TN (FP + TN) , where TN, FP, TP, FN each stands for "true negative", "false positive", "true positive", and "false negative", respectively.In particular, TN, FP, TP, FN are obtained by comparing the computed binding probabilities (of a sequence to have a binding site for a TF) with known binding site information from the test data set, that is, "0" (no binding site) and "1" (at least one binding site).We used area under the curve (AUC) and AUC30 (the AUC for the area between false positive rates [0, 0.3]) to optimize the parameters.In case of four additional data sources, we are dealing with a highdimensional grid search problem.Since the grid size grows exponentially with the dimension, we resort to a heuristic where each parameter is optimized separately using a 1dimensional grid search while keeping the other parameters fixed.Moreover, parameter optimization is done sequentially so that we first optimize parameters a (m) and b (m) for individual data sources.Scaling parameters L 0 , L 1 , . . ., L n are optimized similarly except that L n is always assigned to 1.For example, parameters L 1 and L 0 are optimized using two data sources, which are then kept fixed and assigned to L 2 and L 1 , respectively, when optimizing new parameter L 0 using three data sources, so forth.In our study, we optimized the parameters of up to four data sources, which are L 0 = 0.72, L 1 = 0.72, L 2 = 0.73, L 3 = 0.8, and L 4 = 1, respectively, and q equals 0.93.It is worth noticing that the adjacent L w 's (0 ≤ w ≤ n) tend to be similar for small values of w, and especially we have L 1 = L 0 when n equals 4.This accords well with the main feature of our new data fusion method, which is to search for bona fide locations (indicated by several data sources) and reduce false positives by not paying too much attention to the locations indicated by fewer data sources.All the rest optimized scaling parameters are listed in Table 1.
The scaling parameters, that is, "a", "b" and "c", are relatively robust, whose slight variations would not dramatically affect the results.We varied "a" of the DNA duplex stability data (for both double and single strand binding data), which is supposed to have more effect on the results (recall that "a" weights different information sources and reflects their importance), and listed its AUC scores for single data source as well as its combination with other additional information sources in supplementary Table S2.It is clear that with small changes of "a", the results do not vary significantly.However, for the weighting parameters, that is, "L 0 " to "L n ", and the threshold, q, their small changes may have greater effect on the results, since they determine how different data sources are combined.This can be seen from the closer values among "L 0 " to "L 3 ", which are 0.72, 0.72, and 0.73, respectively.These parameters depend heavily on the quality and type of data, and should be optimized before data integration.

DNA Duplex Stability Data.
The DNA stability measures the amount of energy needed to separate the two strands of DNA.In this study the DNA destabilization energies were obtained from an online tool WebSIDD [15,16], where the parameters were set to "DNA Type: circular", "Energetic Type: near neighbor", and "Energy Cutoff: level 4".Note that circular DNA is assumed to calculate the duplex stabilities of linear DNA.This is because WebSIDD handles linear DNA similarly with circular DNA but adding 50 G/C to the end, which is not needed here given the extended DNA used.We obtained the energy score for each sequence with 1 kb extension from both ends.For every binding site X we computed the energy of destabilization G(X) as the average of the destabilization values G(X, i) for all positions i within this site.

Assessing Binding Preference for Each TF.
Relatively little is reported about specific types of protein-DNA interactions in the literature and the protein domain annotations are not available for all TFs, thus, we decided to assess the binding preference for each factor simply by looking at the DNA stability scores at the known binding sites in the test data set.With the assumption that the binding preference of a TF is the same to all its binding sites, we estimated the binding preference of each TF with the following heuristic.Let A denote the set of all known start binding positions of a TF among all the tested sequences in our test set.For all the known binding sites in A, we compute counts dC and sC which are the number of times respectively, where j is the width of the verified binding site j in the test set and T is the threshold specified by quantile q.Then, the TF is assigned to bind in a double-strand manner if dC > sC, in a single-strand manner if dC < sC, and in cases dC = sC, random preference is assigned.In order to make the above heuristic more robust, we repeated it for three thresholds specified by different quantiles q = {0.6,0.7, 0.8} with both raw G(X, i) and smoothed G(X, i) = i+9 j=i G(X, j)/10 DNA duplex stability scores.The final binding preference of each TF is made by taking a vote among these six binding preferences, and again in case of a tie random binding preference is assigned.

Construction of DNA Duplex Stability
Prior.We built three data sets to construct the DNA duplex stability priors: one positive single-strand binding data set, one positive double-strand binding data set, and one background data set.The positive data sets are constructed from 226 known binding sites in our test data set by splitting the known binding sites into single-and double-strand binding sets according to the binding preference of each TF.The background data set is generated as follows.For each verified binding site in our test set, we randomly select 20 genomic locations (from the same promoter sequence) with the average binding site of length 12, which results in a background set that is 20 times larger than the test set.
The raw DNA duplex stability scores are converted into probabilities using a similar method as in [11] with an extension to account for both single-and double-strand binding preferences.For each data set, we built a histogram of the energies, then normalized and smoothed the values to get a probability distribution.The cumulative distribution functions (CDFs) of the three data sets are shown in Figure 3(a), which indicate that DNA duplex stability data does provide us discriminative information about TFBSs.All known binding sites, on which the performance is eventually evaluated, are used to draw Figure 3(a), which leads to circular reasoning.However, our cross-validation and randomization simulations show that this biasing effect is negligible.For every energy value e and binding site X, the conditional density of the single-and double-strand binding data are P(G(X) = e | X ∈ sB) and P(G(X) = e | X ∈ dB), respectively, where sB and dB denote single-and double-strand TFBSs, respectively.Similarly, for the random genomic locations we have P(G(X) = e).We also estimated the frequency of the randomly chosen DNA sites that have a significant overlap with any of the known single-strand and double-strand binding sites, P(X ∈ sB) and P(X ∈ dB), respectively.Bayes' rule is used to compute the probability that a DNA site X is a single-strand TFBS given its energy (similar calculation is also applied to the double-strand case) 2.4.Nucleosome Occupation Data 2.4.1.Construction of Nucleosome Occupation Prior.We built the nucleosome occupation prior in a similar way as what we did with the DNA stability data, but with only two data sets: positive and background (see also [11]).The positive data set consists of the averaged N-scores (the raw nucleosome occupancy scores obtained using the method in [2]) of the known binding positions.The background data set is composed of the averaged N-scores of randomly selected genomic locations in the same way as above.For every occupation score o, the conditional probabilities for binding and nonbinding sites are denoted as P(N(X) = o | X ∈ B) and P(N(X) = o), respectively.The CDFs of the two nucleosome data sets are shown in Figure 3(b), which indicate that the nucleosome positioning information from [2] is informative of TFBSs.The probability that a DNA site X is a TFBS given its nucleosome occupation score is obtained by (13) (with sB replaced by B).Note that P(X ∈ B) = P(X ∈ sB) + P(X ∈ dB).
2.5.Data.We validate our computational methods using the same mouse data set as in [12], which consists of 47 promoter sequences (as shown in Table 2), each with a varying number of annotated binding sites from ABS [3] and ORegAnno [5] databases (the annotated binding sites are also listed in Table 2).Sequence lengths are 2 Kbps or vary around 500 bps.PSFM models are taken from TRANSFAC [4] (professional version 10.2).The additional data sources used here are conservation, regulatory potential, DNA duplex stability, and nucleosome positioning.The first two data sources are the same as what have been used in [12], where conservation is assessed with the PastCons scores [13] and regulatory potential is constructed from a set of known regulatory and nonregulatory sequences using a discriminatory computational analysis (prediction algorithm is named "ESPERR") [14].DNA duplex stability, and nucleosome positioning are the two new data sources explored in more detail in this study.We use our computational methods to

Results and Discussion
In this section, the results of exploring two novel additional data sources, evaluating the new data fusion method and comparison among different data source combinations in TF target gene prediction are sequentially reported and discussed.The idea of our computational methods is to probabilistically bias the search of binding sites to those genomic locations that are more likely to contain binding site(s) in light of the additional data.The qualities of the TF target gene prediction results are evaluated by the ROC curves and the histograms of the estimated binding probabilities, which are drawn from the probabilities over all the TFs and the sequences being analyzed.The test data set used throughout this paper consists of 47 promoter sequences, each contains a varying number of annotated binding sites from ABS [3] and ORegAnno [5] databases.

DNA Duplex Stability
Prior.Most sequence-specific DNA binding proteins contact with the major groove of double stranded DNA in the B conformation [19], and some TFs are shown to bind DNA in a double-strand manner according to their crystal structures [20].Thus, the DNA destabilization energies at protein binding sites of these TFs are expected to be high.This assumption has been verified in yeast by [11] on improving the accuracy of TFBS discovery, which is a different topic other than TF target gene prediction.On the other hand, during transcription, the two DNA strands must be separated to let RNA polymerase slide along the DNA molecule and synthesize a nascent mRNA.
Since the binding sites for many general TFs are located in the proximal promoter regions of the transcribed gene, it is expected that the DNA double helix of these regions is low, that is, low DNA duplex stability.Besides, there also exists experimental evidence showing that some regulatory proteins bind to DNA in a single-strand manner [21,22].Taken together, these suggest that DNA duplex stability data should be informative of binding sites; whether a lower or higher DNA duplex stability at specific TF binding sites is more preferable depends largely on the binding preference of the TF, that is, whether the TF binds to the the DNA in a double-or single-strand manner.In our study, we assume that TFBSs for TFs with single-strand binding preference occur preferentially in regions with low DNA duplex stability, and the other way around for double-strand binding TFs.
In the TF target gene prediction analysis, the raw DNA duplex destabilization energies were converted into probability values using a Bayesian transformation method, and each TF's binding preference is predicted with a heuristic method (see Section 2 for details).
From the ROC curves shown in Figure 4(a) and supplementary Figure S2(a) we can see that DNA duplex stability alone can slightly improve the TF target gene prediction accuracy, and its performance can be remarkably improved by combining with other priors, such as conservation (Figure 4(c) and supplementary Figure S2(g)) or regulatory potential (Figure 4(b) and supplementary Figure S2(d)).Table 1 also demonstrates that the AUC scores for combining DNA energy with conservation or regulatory potential are higher than those obtained with single additional information sources.These results indicate that DNA duplex stability data has the potential of improving TF target gene prediction depending on how and which data sources it is combined Table 2: Sequences used in this study.One "TFBS duplex stability score" is computed as the average of all the raw DNA duplex stability scores over a given TFBS.The TFBS duplex stability scores are computed for all the binding sites of a promoter sequence.Note that one sequence can have multiple binding TFs and TFBSs, one TF can bind to more than one site, and one TFBS may be recognized by multiple TFs.with.Further, DNA duplex stabilities are expected to be more informative in TF target gene prediction if they are obtained experimentally.
Out of the 23 TFs whose PSFMs are known and studied here, nine are predicted to bind sequences in a single-strand manner and 14 bind sequences in a double-strand manner.Information such as the names and binding promoters (in mouse genome) of these 23 TFs are listed in Tables 2 and 3, with more detailed information available from http://www.probtf.org/.Also shown in Table 2 are the DNA duplex stability scores for all the binding sites in all the promoter sequences used in this paper, each of which is averaged over all the raw stability scores of a TFBS.These TFs include all the (mouse) TFs whose binding site information can be downloaded from ABS [3] or ORegAnno [5] databases and whose binding specificity model(s) can be found from the TRANSFAC database [4] (professional version 10.2).It is seen from Table 3 that, for the six TFs whose binding preferences are known, our predicted binding preferences accord well with the literature-derived information.In order to avoid the possible bias that could be introduced when the binding preference of each TF is predicted from the same data that is used for validation, we also performed the standard leave-one-out cross validation on the binding preference prediction.These results clearly demonstrate that no significant differences are observed.Thus, our binding prediction method when integrated with DNA duplex stability data should have a similar good predictive performance outside our test data set as well.
3.1.2.Nucleosome Occupation Prior.Chromatin structure has an important role in regulating the transcriptional machinery.At the genome level, these mechanisms are controlled by the basic structural subunits, nucleosomes, which can limit the access of TFs to their binding sites [1,17].Thus, from the viewpoint of computational TFBS prediction, the likelihood of a TF binding to nonfunctional sites can be decreased by locating a stable nucleosome over those genomic regions while keeping functional sites accessible for TF binding.The validity of this assumption can be verified, for example, by the fact that the binding of SP1, GAL4, and USF to nucleosome cores requires other proteins such as nucleoplasmin to remove H2A and H2B which consequently results in nucleosome disassembly [28], and proven by the evidence that the binding propensity of glucocorticoid receptor (GR) to the nucleosome core is much lower than that to the nucleosome free sequence [29].
However, the probability of some TFs binding equally well or even better to sequences occupied by nucleosomes compared with nucleosome free regions could not be excluded, where nucleosome location data alone will not be sufficient and multiple data sources may be used to improve the prediction accuracy.
High-resolution genomewide nucleosome positioning data exist for organisms such as yeast [30] and human [31], but in the case of mouse, we currently need to rely on computational predictions.Indeed, this computational prediction problem has attracted lots of interest and improved methods have been proposed recently.ProbTF method was previously tested with predicted nucleosome locations from Segal's original model which rely on dinucleotide frequencies [1] and the nucleosome data was not found to be informative of binding sites.Here we explore the problem that whether more recent and more accurate nucleosome positioning data together with a novel data fusion method can improve TF target gene prediction.In this study, we used a computational multiresolution method developed in [2] to predict the nucleosome locations for all the 47 tested sequences.We decided to use the raw nucleosome positioning data, that is, without the hidden Markov model (HMM) processing, and employ the extended sequences to obtain the N-score for each genomic location.The raw data were further converted into probabilities using a Bayesian transformation method (for details see Section 2).
We compared the two different nucleosome data by integrating them separately into our TF target gene prediction algorithm.It is particularly promising to see that the use of more accurate nucleosome positioning data from [2] results in more accurate TF target gene prediction as shown in supplementary Figure S3(a).Similarly as in the case of DNA duplex stability data, we combined nucleosome data with conservation (supplementary Figure S3(e)) or regulatory potential data (supplementary Figure S3(c)), and the combined data again improve the TF target gene predictions.For example, the AUC score of 0.7555 which is obtained with nucleosome data alone increases to 0.7946 when combined with regulatory potential, and jumps to 0.8334 when combined with conservation.
In order to gain insight into each individual data source and to assess the extent of possible overfitting problem stemming from parameter optimization, we also prepared an additional control simulation.We shifted each additional data source by 100 base pair positions and then applied our computational methods as explained above, including binding preference prediction and optimization of parameters, to test performance of randomized data.ROC curves corresponding to the four shifted information sources are shown in supplementary Figure S4 and the AUC scores after shifting for each data source are recorded in Table 1.For the two novel data sources, we also compared their CDFs after shifting with the original ones as shown in Figure 3.The Kolmogorove-Smimov statistic (KS statistic) for CDFs of DNA duplex stability scores at random sites and double strand binding sites is 0.3097, and that of random sites and single strand binding sites is 0.3641 (as depicted in Figure 3(a)).However, after shifting, the KS statistics between random locations and double strand binding sites and between random sites and single strand binding sites become 0.1905 and 0.1168, respectively, (see Figure 3(c)).Similarly, the KS statistic between the CDFs of nucleosome positioning data at random sites and nucleosome binding sites is 0.1699 (Figure 3(b)) and drops to 0.0379 after shifting (Figure 3(d)).We also measured the Kullback-Leibler divergence (KL divergence) between each density pair.The KL divergence between PDFs of DNA duplex stability scores at random sites and double and single strand binding sites are 0.1868 and 0.6617 (Figure 3(a)), which decreases to 0.1037 and 0.1065, respectively, after shifting (Figure 3(c)).Likewise, the KL divergence between PDFs of nucleosome positioning data at random sites and nucleosome occupied sites drops from 0.1830 to 0.0330 after shifting, as represented by Figures 3(b) and 3(d), respectively.These results show that no information is gained from the shifted data sources.Taken together with the cross-validation results shown above, this demonstrates that the improved binding prediction accuracy is not an artifact of overfitting.
We further compared the scaling parameter a (see (11) in Section 2) when integrating different nucleosome data and DNA duplex stability data into the TF target gene prediction framework.The parameter a essentially determines the weight of each individual information source.As shown in Table 1, parameter a of nucleosome positioning data obtained from [2] (0.04) is higher than that obtained with data from [1] (0.01), which is consistent with results in supplementary Figure S3(a) where nucleosome data from [2] clearly provides more information than those obtained from [1].Similarly, parameter a of DNA duplex stability data for TFs with single-strand binding pattern (0.06) is higher than that for TFs with double-strand binding pattern (0.01).This is again consistent with results shown in Figure 3(a), where DNA stability energies of single-strand binding TFs provide much better discrimination than those of double-strand binding TFs.These results show that the scaling parameter a has an association with data quality, where a higher a indicates a more informative data.

Multiple Data Fusion Method.
We next briefly demonstrate the performance of the new data fusion method and compare it with that of a standard weighting-based scheme proposed in [12].Qualitatively, the previous data fusion method is based on a type of averaging where a genomic location is suggested to contain a binding site only if a large majority of the additional data sources indicate a binding site, whereas the new method can assign more prior probability to a genomic location if it is indicated as a binding site by a few (or even a single) more informative data sources (see Section 2 for a detailed technical description of our data fusion methods).
The performance of the old and new data fusion methods are illustrated in supplementary Figure S1, which shows the ROC curves for finding the verified binding sites in the gene promoters set using both evolutionary conservation and regulatory potential.Parameters in supplementary Figures S1(a) and S1(c) are chosen by the whole AUC and the AUC30, respectively.Supplementary Figure S1(a) shows that the new method works better than the old one by generating higher overall AUC, and supplementary Figure S1(c) demonstrates that the new method can improve the prediction accuracy especially in low false positive rate (FPR) region, which is a highly preferable property in general.
Supplementary Figures S1(b) and S1(d) show the histograms of the predicted binding probabilities for both the old and new data fusion methods, where the parameters in Figures S1(b) and S1(d) are selected according to the whole AUC and AUC30, respectively.Histograms are drawn separately for negative and positive cases and, hence, these graphs clearly demonstrate how well the two methods are able to discriminate the target genes that contain known binding sites from nontarget genes that do not contain binding sites.From these graphs, we can see that the new method improves discrimination by assigning much smaller binding probabilities for sequences with no known binding sites (no matter whether AUC or AUC30 is used), which thus results in much smaller false positive rate.AUC scores for single and all combinations of multiple data sources are summarized in an ascending order in Table 1, and their corresponding data fusion results are shown and discussed in the following sections.

Comparison of Combinations of Information Sources.
In order to better understand that which combinations of additional genome-level data sources are most informative of TFBSs, we compared the TF target gene prediction accuracy of all possible combinations among evolutionary conservation, regulatory potential, nucleosome locations, and DNA duplex stability.The best combination is conservation and nucleosome positioning, whose results have already been shown in supplementary Figure S3(e).
Results for all the six duplets of data sources are reported in supplementary Figure S5, which shows that most of the combinations of two data sources work better than their corresponding single data sources except for the combination of nucleosome occupation and DNA energy.This suggests that certain redundancy might exist between nucleosome occupation and DNA energy, which is not entirely surprising since a DNA region that is not within a nucleosome is likely to need less energy to destabilize the two strands than DNA within a nucleosome.This motivates us to group the four information sources into two categories, where group 1 includes evolutionary conservation and regulatory potential, and group 2 includes nucleosome locations and DNA duplex stability.Our results indicate that when a pair of data sources come from different groups, that is, have little redundancy, their joint performance can be better than those of their corresponding single data sources.Moreover, the best performance is achieved with a pair of additional data sources (supplementary Figure S5(b)), and adding more information sources into this pair cannot further improve the accuracy.The above results and analysis suggest that combining data sources that are redundant does not necessarily improve the overall performance.In other words, in order to gain a better prediction accuracy it is better to combine data sources that provide information from different perspectives of the same biological system.
Results for all four triplets of data sources are shown in supplementary Figure S6, which all perform better than their corresponding single data sources.It is seen that the best result is obtained by combining conservation, regulatory potential and nucleosome positioning, which accords well with our expectation since "conservation and regulatory potential" is the most informative pair in the lower false positive region (supplementary Figure S5(f)), and "nucleosome positioning, and regulatory potential" forms the best pair with respect to higher false positive region (supplementary Figure S5(d)).
Supplementary Figure S7 shows the ROC curve for the only quartet.Although one could expect that adding more information sources into TF target gene prediction always improves the prediction accuracy, our results show that it is not always the case.This finding is understood by realizing the difficulty of combining complex and poorly characterized genome-level data sources into TF target gene prediction.

Conclusions
We have three main contributions in this paper.Firstly, we have developed a new data integration method for TF target gene prediction from multiple data sources.The new method is compared with the one employed in [12] using a TF target gene prediction algorithm called ProbTF [12], and the results show that the new data fusion principle improves the previous method by lower false positive rate.Secondly, we have demonstrated the use of two novel information sources, DNA duplex stability and raw nucleosome occupancy predictions from a method proposed in [2], to guide TF target gene predictions.Our results show that both nucleosome occupancy and DNA stability data can improve TF target gene prediction accuracy especially when combined with evolutionary conservation or with conservation and regulatory potential.Moreover, more accurate nucleosome predictions result in better TF target gene predictions.It is also worth noticing that we do not distinguish different TFs regarding data source usage except for DNA duplex stabilities, where double or single strand binding proteins are treated differently and a heuristic method is adopted to classify them.Thirdly, we have compared all the possible combinations among conservation, regulatory potential, nucleosome positioning and DNA stability, whose results can be availed in data source selection or preparation when dealing with data integration problem in a particular application.We grouped the four tested information sources into two categories based on biological arguments: group 1 contains conservation and regulatory potential, and group 2 consists of nucleosome locations and DNA duplex stability.We found that combining data from different groups is more likely to improve TF target gene predictions presumably because data sources between the two groups are less redundant.
Although the assumption that all TFs bind to DNA in double-strand manner works well in yeast [11], it may not be sufficient in higher organisms, such as mouse, as shown in this study (see, e.g., Figure 3(a)).Instead, we obtained informative DNA duplex stability prior by assuming different binding preferences for different TFs.We constructed the binding preference of each TF with a simple heuristic which assesses the binding preference for a TF from a set of known binding sites.We have used cross-validation and an additional base pair shifting simulations to show that binding preference prediction and parameter optimization do not result in any (optimistic) bias, or overfitting, in binding prediction accuracy.However, the use of the DNA duplex stability data is limited because little verified information about TF binding specificities can be found from the literature and, therefore, binding specificities need to be learned from the data as well which currently requires that a set of verified binding sites is known.Future research goals include to develop an (unsupervised) algorithm for predicting the binding preference for TFs without prior knowledge of the known binding sites.Moreover, it is possible that one TF may have multiple folding modes, and can bind different sequences with different patterns.For example, MyoD, a member of helix-loop-helix protein family, can not only recognize the double-stranded DNA-binding site (called Ebox) in many muscle and nonmuscle genes, but also bind to the noncoding strand of an E-box from the muscle-specific creatine kinase enhancer in a single-stranded manner [25].To take this possibility into account, a more sophisticated assumption can be applied; that is, TFs can have different binding preferences to different sequences or under different experimental conditions.In this direction, we can also try to incorporate other data sources, such as ChIP-chip data, into our data fusion framework.
Nucleosome positioning data is employed in this study assuming that nucleosomes compete with DNA binding proteins [1] for target DNA binding sites.Although this assumption is generally true, we could not exclude the possibility that some TFs may selectively bind to nucleosomeoccupied regions.Binding sites of such TFs, if exist, can not be recognized by the method presented here when employing nucleosome occupancy data, but can be rescued, for example, by incorporating other information sources.

Figure 1 :
Figure 1: Illustration of data fusion problem in TF target gene prediction.The promoter sequence names are shown above the arrow, and the arrow corresponds to transcription start site (TSS).Horizontal axis corresponds to position relative to TSS.The red bar(s) together with a TF name on the first line of each figure represent the known binding site.For a given TF, data shown in grey (named with TRANSFAC IDs) represent models corresponding to different position-specific frequency matrices (PSFM) that are found from the TRANSFAC database.Evolutionary conservation (green), regulatory potential (blue), two nucleosome positioning signals[1,2] (magenta), and DNA duplex stability data (red) are shown in the following five rows (abbreviated with con., reg., npy., nuc.and DNA., resp.).The joint prior from all the four additional data sources (black) is shown in the last row.TFs shown in panels (a) and (b) are assumed to bind to their corresponding sequences in a double-strand manner, while TFs in panels (c) and (d) bind in a single-strand manner.All plotted data are for mouse genome.

P ri o r 1 P r io r 2 Figure 2 :
Figure2: An illustration of the prior integration method.An illustration of the prior integration method for the case of two additional data sources.x and y axes correspond to the two data sources and z-axis corresponds to the integrated prior.

Figure 3 :
Figure 3: CDFs of novel information sources at known TFBSs and random sites.CDFs of (a) DNA duplex destabilization energies at TFBSs of single-strand, double-strand binding TFs, and random DNA sites, (b) nucleosome occupation scores at known TFBSs and random DNA sites.Panels (c) and (d) are similar with (a) and (b), respectively, but with each information scores shifted 100 bps.

Figure 4 :
Figure 4: ROC curves for incorporating DNA duplex stability data.ROC curves of the estimated binding probabilities for the proposed data fusion method when combined with (a) DNA duplex stability data, (b) DNA duplex stability data and regulatory potential, and (c) DNA duplex stability data and evolutionary conservation data.

Table 1 :
AUC scores and scaling parameters for all data sources and their combinations.Data source combinations from 0 to 4 information sources are colored grey, green, blue, yellow, and magenta, respectively."a" and "b" are the multiplicative factor and bias term, respectively, for scaling each additional data source, and "c" is the scaling parameter used for combining multiple information sources into the TF target gene prediction framework.All the parameters shown here are selected with respect to the largest AUC scores.

Table 3 :
Transcription factors used in this study."1" and "2" each represents that the corresponding TF binds to DNA in a single and double strand manner, respectively.Empty blank means no literature information is found.