- Research Article
- Open Access
Novel Data Fusion Method and Exploration of Multiple Information Sources for Transcription Factor Target Gene Prediction
EURASIP Journal on Advances in Signal Processing volume 2010, Article number: 235795 (2010)
Background. Revealing protein-DNA interactions is a key problem in understanding transcriptional regulation at mechanistic level. Computational methods have an important role in predicting transcription factor target gene genomewide. Multiple data fusion provides a natural way to improve transcription factor target gene predictions because sequence specificities alone are not sufficient to accurately predict transcription factor binding sites. Methods. Here we develop a new data fusion method to combine multiple genome-level data sources and study the extent to which DNA duplex stability and nucleosome positioning information, either alone or in combination with other data sources, can improve the prediction of transcription factor target gene. Results. Results on a carefully constructed test set of verified binding sites in mouse genome demonstrate that our new multiple data fusion method can reduce false positive rates, and that DNA duplex stability and nucleosome occupation data can improve the accuracy of transcription factor target gene predictions, especially when combined with other genome-level data sources. Cross-validation and other randomization tests confirm the predictive performance of our method. Our results also show that nonredundant data sources provide the most efficient data fusion.
A central problem in molecular system biology is to understand the manner in which a cell operates its complex transcriptional machinery. At molecular level, transcriptional processes are largely controlled by transcription factors (TFs) that bind to gene promoters in a sequence-specific manner and, thereby, inhibit or promote the expression of their target genes. Collectively, these DNA-binding proteins and other molecules work together to implement the complex regulatory machinery that controls gene expression. Since large-scale understanding of transcriptional regulation is still severely limited even in lower organisms, it is of great importance to reveal these regulatory protein-DNA interactions genomewide.
Experimentally verified TF-binding sites (TFBSs) have been collected in databases [3–5] and recently developed experimental methods, such as ChIP-chip or ChIP-seq, are capable of measuring in vivo TFBSs in high-throughput manner. However, it is not possible to obtain sufficient coverage, that is, to screen all TFs under all conditions, using experimental methods alone. Therefore, the binding site prediction problem calls for computational methods. Computational predictions rely on sequence specificities that are typically taken from a database  or obtained as an output from a motif discovery method . Recent progress on experimental side has made it also possible to measure TF-binding specificities in high-throughput manner . The advent of these experimental techniques equips TF target gene prediction methods with much more accurate binding specificity models and, indeed, opens a whole new avenue for computational analysis of TF-DNA binding.
Sequence specificities alone, however, are not sufficiently informative to accurately predict TFBSs simply because the probability of observing an exact copy of a presumably functional binding motif in a genome by chance is remarkably high. A natural way to improve TF target gene predictions is to incorporate additional information into statistical inference of TFBSs. A number of additional data sources can be useful for this purpose, including, among others, information on coregulated genes, evolutionary conservation, physical binding locations as measured by ChIP-chip or ChIP-seq, nucleosome occupancies, CpG islands, regulatory potential, DNase hypersensitive sites, and so on. Incorporating additional information sources to guide statistical inference has successfully been made use of in the context of motif discovery [8–11], but has not attracted enough attention in TF target gene prediction. We have recently developed a probabilistic TF target gene prediction method, ProbTF, which can incorporate practically any additional genome-level information source to predict TF target gene .
Statistical data fusion for TF target gene prediction becomes more challenging in the case of multiple information sources. Here we develop a new method for multiple data fusion and incorporate novel data sources into TF target gene prediction. Four genome-level additional information sources (i.e., information at the level of individual nucleotides), evolutionary conservation, nucleosome positioning data from a recently published computational method, regulatory potential, and DNA duplex stability, are employed here to improve TF target gene prediction, which is expected to be informative of binding sites as will be discussed shortly. Some of these and other individual data sources have already been shown to improve de novo motif discovery [8–11]. Here we demonstrate how multiple data sources can be combined to make joint statistical inference of TF target gene. Integration of data sources that have a probabilistic interpretation is relatively straightforward , and for other data sources we convert the raw data into probabilities, or prior distributions, by extending a previously proposed Bayesian transformation method . In addition, for efficient use of DNA duplex stability data, we propose a simple heuristic that can assess the binding preference (single versus double-stranded DNA) for a TF from a set of known binding sites. Results on a carefully constructed set of verified binding sites in mouse genome [3, 5, 12] demonstrate that the new data fusion method that we propose here improves the performance of TF target gene prediction methods. We also demonstrate that a number of genome-level data sources, either alone or especially in combination, are highly informative of TF target gene. Consequently, our statistical data fusion method can gain valuable new insights into genomewide models of transcriptional regulatory networks.
Given the fundamental role of TFs in transcriptional regulation, we focus on predicting TF target gene. Because each individual data source is noisy and gives only a partial view of the underlying regulatory mechanisms, we focus on making statistical inference for TFBSs from multiple information sources. The essence of the data fusion problem that we encounter is illustrated in Figure 1, which shows four examples of verified binding sites from the test data set together with the associated additional genome-level data sources . The first row in each subplot shows the annotated binding site(s) for a TF in a gene promoter. The next rows (named by their TRANSFAC IDs, grey) show the log-likelihood scores of the position specific frequency matrix (PSFM) models to the Markovian background model . The following five rows show the additional data sources: probability of conservation (con. , green), regulatory potential (reg. , blue), nucleosome positioning signals predicted by two different methods (npy.  and nuc. , magenta), and DNA duplex stability (DNA. [15, 16], red) score for each position of the sequences. The joint prior combined from all the explored additional data sources is shown in black in the last row. The median and mean of the scores for each data type applied to the sequences shown in Figure 1 are recorded in Table in supplementary material available online at doi: 10.1155/2010/235795.
Figure 1 shows that the highest log-likelihood score is not always obtained at the annotated binding site. TFs are commonly associated with multiple PSFMs since one TF may allow certain variation in its binding motif. Thus, it can be difficult to combine predictions from multiple PSFMs given that these PSFMs may be extremely similar or different. This issue can be solved by, for example, ProbTF method, which implements an intuitive way of combining predictions by multiple PSFMs: ProbTF considers all possible numbers of nonoverlapping TFBSs in all possible locations and configurations and weights each configuration according to its probability. A more difficult problem is to decide that which of the peaks predicted by PSFMs correspond to real, functional binding sites. As illustrated in Figure 1, the PSFM-based profiles have relatively good sensitivity but poor specificity, which is common to many PSFMs. The lack of specificity can be greatly improved by genome-level data fusion, which forms the focus of this study.
Corresponding to what is known about transcriptional regulation, many of the verified binding sites typically have high degree of conservation  and high regulatory potential scores  and are typically free of stable nucleosomes (i.e., have low nucleosome occupancy scores) . Moreover, DNA double helix destabilization energies at TF binding sites are different from those at random sites . In particular, TFBSs tend to have high DNA duplex stability score if a TF prefers to bind both strands of the promoter sequence (Figures 1(a) and 1(b)) and low DNA stability score in the opposite case (Figures 1(c) and 1(d)) The above reasoning seems to provide a simple logic for filtering the real TFBSs.
However, correlation between TFBSs and any of the additional data sources cannot be expected to be perfect even from a biological point of view. For example, only about 50% of functional binding sites are assessed to be evolutionary conserved . The additional information sources are also noisy, regardless of whether they are experimental measurements or computational predictions. The only possibility is to make statistical inference, which takes the inherent randomness into account, from multiple genome-level data sources. The rationale is that the accuracy of computational TF target gene predictions naturally improves when more (useful) information is incorporated into statistical analysis.
2.1. Probabilistic Framework for TF Target Gene Prediction
We first describe the TF target gene prediction algorithm employed in this study (full details can be found from ). Let denote a single strand of a promoter sequence, where and is the length of the sequence (generalization to double-stranded DNA sequences is also possible but omitted here). Let denote the number of (unknown) binding sites and the (hidden) start positions of nonoverlapping binding sites in sequence ; that is, if then .
Nonbinding site (i.e., background) sequence locations are modeled by the order Markovian background model . Assuming that we have access to the previous nucleotides before the start of the actual sequence , the likelihood of a sequence having no binding sites for any TF is , where . We set since that value provides the best results in . TFBSs are modeled with the standard PSFM model which is a product of independent multinomial distributions. Let denote the probability of observing nucleotide at the position of , where is the length of the motif. Assume a TF is characterized by PSFMs, . Define as the configuration of motif models from in ; that is, specifies the motif model , which begins from location and has a length . Further, the probability of sequence , given nonoverlapping motif positions and the motif and background models, is
The probability that a sequence has binding sites is obtained with Bayes' rule
where the normalization factor is and is the maximum number of nonoverlapping motifs in an -length sequence. As proposed in , the prior of the number of motif instances, , is assumed to be independent of and and has an exponential form
where . We use . This formula defines the (user definable) prior expectation of the number of binding sites in a given DNA sequence. Importantly, it does not incorporate any of the informative data sources studied here. This prior, primarily only, increases or decreases of the estimated binding probabilities, and as such has little effect on, for example, the ROC curves. The probability is obtained with the assumption that, for a fixed value of , the prior over binding site positions and configurations is uniform and inversely proportional to the number of different binding site positions and configurations. The probability is obtained by summing over all possible positions and configurations, and can be computed efficiently using a recursive formula .
Finally, the probability that a TF which is characterized by binds to a promoter , , is defined as the probability that at least one of the motif models in has a binding site in
Integration of additional data sources into the aforementioned probabilistic TF target gene prediction framework is carried out by assuming that the data sources are in the form of where is the probability that the base pair location is a binding site. can be derived from a single data source or from multiple data sources (see subsections "DNA duplex stability data", "Nucleosome occupation data", and "Data integration method" of this section for details). Assuming that and are conditionally independent and the probability of does not depend on the PSFM and background models, the probability of and given , , , and is
Following (1), the probability is modeled as
and, thus, the joint probability can be written compactly as
where and . Consequently, the same efficient recursive algorithm can be used to compute (see  for more details).
Note that the choice of Markovian and PSFM models is arbitrary. Also note that since additional data are incorporated using probabilities of binding over the promoter sequence; we could also employ methods other than ProbTF.
2.2. Data Integration Method
Define the additional genome-level data source (for a single gene promoter having length ) as . Denote the probabilities for position from different data sources as , . Further, define a thresholded version of probabilities as
where is a threshold for the data source and is defined as a percentile of the distribution of the data source. Then the thresholded scores for position can be written as , . Let be the number of data sources that exceed their thresholds at location , then the integrated probability for position , , is calculated as
The data integration method is parameterized by and . Note that and . It is also worth noting that the resulting probabilities do not include hard thresholding for any of the genomic locations although thresholding is involved in integration, and the use of thresholding during the construction is motivated by its simple yet powerful parametrization.
The data integration method is illustrated in Figure 2 for the case of two additional data sources with parameters , , , and . For illustration purposes, both data sources are assumed to have uniform distribution and hence .
In the above genome-level data integration method there are ( is the number of additional data sources) weighting parameters , and one threshold for emphasizing the most informative binding locations. There are also two scaling parameters, a multiplicative factor , and a bias term , for each additional data source, and one scaling parameter, , for combining other data sources with the TF target gene prediction analysis. These parameters are used to scale the original probability values into a proper range. In particular, the scaling parameters are used in the following way (for the data source):
where is the probability that a DNA site is a TFBS () given the value of the raw data . For conservation and regulatory potential the original data are already in a probabilistic format, and for nucleosome and DNA stability data the conversion of the raw data into probabilities was described in the previous sections. Probability is the final integrated prior probability for position after scaling, which is directly used in further TFBS prediction as explained, for example, in (6) and (7).
All the parameters needed in this study were chosen by a grid search method via optimizing receiver operating characteristic (ROC) curves, and the importance of each data source could be reflected by the multiplicative factor ""; that is, the higher the multiplicative factor the less noisy or more important this type of data is. "1-specificity" ( axis) and "sensitivity" ( axis) are used to draw the ROC curves according to
where , , , each stands for "true negative", "false positive", "true positive", and "false negative", respectively. In particular, , , , are obtained by comparing the computed binding probabilities (of a sequence to have a binding site for a TF) with known binding site information from the test data set, that is, "0" (no binding site) and "1" (at least one binding site). We used area under the curve (AUC) and AUC30 (the AUC for the area between false positive rates ) to optimize the parameters. In case of four additional data sources, we are dealing with a high-dimensional grid search problem. Since the grid size grows exponentially with the dimension, we resort to a heuristic where each parameter is optimized separately using a 1-dimensional grid search while keeping the other parameters fixed. Moreover, parameter optimization is done sequentially so that we first optimize parameters and for individual data sources. Scaling parameters are optimized similarly except that is always assigned to 1. For example, parameters and are optimized using two data sources, which are then kept fixed and assigned to and , respectively, when optimizing new parameter using three data sources, so forth. In our study, we optimized the parameters of up to four data sources, which are , , , , and , respectively, and equals 0.93. It is worth noticing that the adjacent 's () tend to be similar for small values of , and especially we have when equals 4. This accords well with the main feature of our new data fusion method, which is to search for bona fide locations (indicated by several data sources) and reduce false positives by not paying too much attention to the locations indicated by fewer data sources. All the rest optimized scaling parameters are listed in Table 1.
The scaling parameters, that is, "", "" and "", are relatively robust, whose slight variations would not dramatically affect the results. We varied "" of the DNA duplex stability data (for both double and single strand binding data), which is supposed to have more effect on the results (recall that "" weights different information sources and reflects their importance), and listed its AUC scores for single data source as well as its combination with other additional information sources in supplementary Table . It is clear that with small changes of "", the results do not vary significantly. However, for the weighting parameters, that is, "" to "", and the threshold, , their small changes may have greater effect on the results, since they determine how different data sources are combined. This can be seen from the closer values among "" to "", which are 0.72, 0.72, and 0.73, respectively. These parameters depend heavily on the quality and type of data, and should be optimized before data integration.
2.3. DNA Duplex Stability Data
The DNA stability measures the amount of energy needed to separate the two strands of DNA. In this study the DNA destabilization energies were obtained from an online tool WebSIDD [15, 16], where the parameters were set to "DNA Type: circular", "Energetic Type: near neighbor", and "Energy Cutoff: level 4". Note that circular DNA is assumed to calculate the duplex stabilities of linear DNA. This is because WebSIDD handles linear DNA similarly with circular DNA but adding 50 G/C to the end, which is not needed here given the extended DNA used. We obtained the energy score for each sequence with 1 kb extension from both ends. For every binding site we computed the energy of destabilization as the average of the destabilization values for all positions within this site.
2.3.1. Assessing Binding Preference for Each TF
Relatively little is reported about specific types of protein-DNA interactions in the literature and the protein domain annotations are not available for all TFs, thus, we decided to assess the binding preference for each factor simply by looking at the DNA stability scores at the known binding sites in the test data set. With the assumption that the binding preference of a TF is the same to all its binding sites, we estimated the binding preference of each TF with the following heuristic. Let denote the set of all known start binding positions of a TF among all the tested sequences in our test set. For all the known binding sites in , we compute counts and which are the number of times and , respectively, where is the width of the verified binding site in the test set and is the threshold specified by quantile . Then, the TF is assigned to bind in a double-strand manner if , in a single-strand manner if , and in cases , random preference is assigned. In order to make the above heuristic more robust, we repeated it for three thresholds specified by different quantiles with both raw and smoothed DNA duplex stability scores. The final binding preference of each TF is made by taking a vote among these six binding preferences, and again in case of a tie random binding preference is assigned.
2.3.2. Construction of DNA Duplex Stability Prior
We built three data sets to construct the DNA duplex stability priors: one positive single-strand binding data set, one positive double-strand binding data set, and one background data set. The positive data sets are constructed from 226 known binding sites in our test data set by splitting the known binding sites into single- and double-strand binding sets according to the binding preference of each TF. The background data set is generated as follows. For each verified binding site in our test set, we randomly select 20 genomic locations (from the same promoter sequence) with the average binding site of length 12, which results in a background set that is 20 times larger than the test set.
The raw DNA duplex stability scores are converted into probabilities using a similar method as in  with an extension to account for both single- and double-strand binding preferences. For each data set, we built a histogram of the energies, then normalized and smoothed the values to get a probability distribution. The cumulative distribution functions (CDFs) of the three data sets are shown in Figure 3(a), which indicate that DNA duplex stability data does provide us discriminative information about TFBSs. All known binding sites, on which the performance is eventually evaluated, are used to draw Figure 3(a), which leads to circular reasoning. However, our cross-validation and randomization simulations show that this biasing effect is negligible. For every energy value and binding site , the conditional density of the single- and double-strand binding data are and , respectively, where and denote single- and double-strand TFBSs, respectively. Similarly, for the random genomic locations we have . We also estimated the frequency of the randomly chosen DNA sites that have a significant overlap with any of the known single-strand and double-strand binding sites, and , respectively. Bayes' rule is used to compute the probability that a DNA site is a single-strand TFBS given its energy (similar calculation is also applied to the double-strand case)
2.4. Nucleosome Occupation Data
2.4.1. Construction of Nucleosome Occupation Prior
We built the nucleosome occupation prior in a similar way as what we did with the DNA stability data, but with only two data sets: positive and background (see also ). The positive data set consists of the averaged -scores (the raw nucleosome occupancy scores obtained using the method in ) of the known binding positions. The background data set is composed of the averaged -scores of randomly selected genomic locations in the same way as above. For every occupation score , the conditional probabilities for binding and nonbinding sites are denoted as and , respectively. The CDFs of the two nucleosome data sets are shown in Figure 3(b), which indicate that the nucleosome positioning information from  is informative of TFBSs. The probability that a DNA site is a TFBS given its nucleosome occupation score is obtained by (13) (with replaced by ). Note that .
We validate our computational methods using the same mouse data set as in , which consists of 47 promoter sequences (as shown in Table 2), each with a varying number of annotated binding sites from ABS  and ORegAnno  databases (the annotated binding sites are also listed in Table 2). Sequence lengths are 2 Kbps or vary around 500 bps. PSFM models are taken from TRANSFAC  (professional version 10.2). The additional data sources used here are conservation, regulatory potential, DNA duplex stability, and nucleosome positioning. The first two data sources are the same as what have been used in , where conservation is assessed with the PastCons scores  and regulatory potential is constructed from a set of known regulatory and nonregulatory sequences using a discriminatory computational analysis (prediction algorithm is named "ESPERR") . DNA duplex stability, and nucleosome positioning are the two new data sources explored in more detail in this study. We use our computational methods to predict that whether the promoter of a gene has TFBS(s) or not.
3. Results and Discussion
In this section, the results of exploring two novel additional data sources, evaluating the new data fusion method and comparison among different data source combinations in TF target gene prediction are sequentially reported and discussed. The idea of our computational methods is to probabilistically bias the search of binding sites to those genomic locations that are more likely to contain binding site(s) in light of the additional data. The qualities of the TF target gene prediction results are evaluated by the ROC curves and the histograms of the estimated binding probabilities, which are drawn from the probabilities over all the TFs and the sequences being analyzed. The test data set used throughout this paper consists of 47 promoter sequences, each contains a varying number of annotated binding sites from ABS  and ORegAnno  databases.
3.1. Novel Informative Data Sources
3.1.1. DNA Duplex Stability Prior
Most sequence-specific DNA binding proteins contact with the major groove of double stranded DNA in the B conformation , and some TFs are shown to bind DNA in a double-strand manner according to their crystal structures . Thus, the DNA destabilization energies at protein binding sites of these TFs are expected to be high. This assumption has been verified in yeast by  on improving the accuracy of TFBS discovery, which is a different topic other than TF target gene prediction. On the other hand, during transcription, the two DNA strands must be separated to let RNA polymerase slide along the DNA molecule and synthesize a nascent mRNA. Since the binding sites for many general TFs are located in the proximal promoter regions of the transcribed gene, it is expected that the DNA double helix of these regions is low, that is, low DNA duplex stability. Besides, there also exists experimental evidence showing that some regulatory proteins bind to DNA in a single-strand manner [21, 22]. Taken together, these suggest that DNA duplex stability data should be informative of binding sites; whether a lower or higher DNA duplex stability at specific TF binding sites is more preferable depends largely on the binding preference of the TF, that is, whether the TF binds to the the DNA in a double- or single-strand manner. In our study, we assume that TFBSs for TFs with single-strand binding preference occur preferentially in regions with low DNA duplex stability, and the other way around for double-strand binding TFs.
In the TF target gene prediction analysis, the raw DNA duplex destabilization energies were converted into probability values using a Bayesian transformation method, and each TF's binding preference is predicted with a heuristic method (see Section 2 for details).
From the ROC curves shown in Figure 4(a) and supplementary Figure (a) we can see that DNA duplex stability alone can slightly improve the TF target gene prediction accuracy, and its performance can be remarkably improved by combining with other priors, such as conservation (Figure 4(c) and supplementary Figure (g)) or regulatory potential (Figure 4(b) and supplementary Figure (d)). Table 1 also demonstrates that the AUC scores for combining DNA energy with conservation or regulatory potential are higher than those obtained with single additional information sources. These results indicate that DNA duplex stability data has the potential of improving TF target gene prediction depending on how and which data sources it is combined with. Further, DNA duplex stabilities are expected to be more informative in TF target gene prediction if they are obtained experimentally.
Out of the 23 TFs whose PSFMs are known and studied here, nine are predicted to bind sequences in a single-strand manner and 14 bind sequences in a double-strand manner. Information such as the names and binding promoters (in mouse genome) of these 23 TFs are listed in Tables 2 and 3, with more detailed information available from http://www.probtf.org/. Also shown in Table 2 are the DNA duplex stability scores for all the binding sites in all the promoter sequences used in this paper, each of which is averaged over all the raw stability scores of a TFBS. These TFs include all the (mouse) TFs whose binding site information can be downloaded from ABS  or ORegAnno  databases and whose binding specificity model(s) can be found from the TRANSFAC database  (professional version 10.2). It is seen from Table 3 that, for the six TFs whose binding preferences are known, our predicted binding preferences accord well with the literature-derived information. In order to avoid the possible bias that could be introduced when the binding preference of each TF is predicted from the same data that is used for validation, we also performed the standard leave-one-out cross validation on the binding preference prediction. These results clearly demonstrate that no significant differences are observed. Thus, our binding prediction method when integrated with DNA duplex stability data should have a similar good predictive performance outside our test data set as well.
3.1.2. Nucleosome Occupation Prior
Chromatin structure has an important role in regulating the transcriptional machinery. At the genome level, these mechanisms are controlled by the basic structural subunits, nucleosomes, which can limit the access of TFs to their binding sites [1, 17]. Thus, from the viewpoint of computational TFBS prediction, the likelihood of a TF binding to nonfunctional sites can be decreased by locating a stable nucleosome over those genomic regions while keeping functional sites accessible for TF binding. The validity of this assumption can be verified, for example, by the fact that the binding of SP1, GAL4, and USF to nucleosome cores requires other proteins such as nucleoplasmin to remove H2A and H2B which consequently results in nucleosome disassembly , and proven by the evidence that the binding propensity of glucocorticoid receptor (GR) to the nucleosome core is much lower than that to the nucleosome free sequence . However, the probability of some TFs binding equally well or even better to sequences occupied by nucleosomes compared with nucleosome free regions could not be excluded, where nucleosome location data alone will not be sufficient and multiple data sources may be used to improve the prediction accuracy.
High-resolution genomewide nucleosome positioning data exist for organisms such as yeast  and human , but in the case of mouse, we currently need to rely on computational predictions. Indeed, this computational prediction problem has attracted lots of interest and improved methods have been proposed recently. ProbTF method was previously tested with predicted nucleosome locations from Segal's original model which rely on dinucleotide frequencies  and the nucleosome data was not found to be informative of binding sites. Here we explore the problem that whether more recent and more accurate nucleosome positioning data together with a novel data fusion method can improve TF target gene prediction. In this study, we used a computational multiresolution method developed in  to predict the nucleosome locations for all the 47 tested sequences. We decided to use the raw nucleosome positioning data, that is, without the hidden Markov model (HMM) processing, and employ the extended sequences to obtain the -score for each genomic location. The raw data were further converted into probabilities using a Bayesian transformation method (for details see Section 2).
We compared the two different nucleosome data by integrating them separately into our TF target gene prediction algorithm. It is particularly promising to see that the use of more accurate nucleosome positioning data from  results in more accurate TF target gene prediction as shown in supplementary Figure (a). Similarly as in the case of DNA duplex stability data, we combined nucleosome data with conservation (supplementary Figure (e)) or regulatory potential data (supplementary Figure (c)), and the combined data again improve the TF target gene predictions. For example, the AUC score of 0.7555 which is obtained with nucleosome data alone increases to 0.7946 when combined with regulatory potential, and jumps to 0.8334 when combined with conservation.
In order to gain insight into each individual data source and to assess the extent of possible overfitting problem stemming from parameter optimization, we also prepared an additional control simulation. We shifted each additional data source by 100 base pair positions and then applied our computational methods as explained above, including binding preference prediction and optimization of parameters, to test performance of randomized data. ROC curves corresponding to the four shifted information sources are shown in supplementary Figure and the AUC scores after shifting for each data source are recorded in Table 1. For the two novel data sources, we also compared their CDFs after shifting with the original ones as shown in Figure 3. The Kolmogorove-Smimov statistic (KS statistic) for CDFs of DNA duplex stability scores at random sites and double strand binding sites is 0.3097, and that of random sites and single strand binding sites is 0.3641 (as depicted in Figure 3(a)). However, after shifting, the KS statistics between random locations and double strand binding sites and between random sites and single strand binding sites become 0.1905 and 0.1168, respectively, (see Figure 3(c)). Similarly, the KS statistic between the CDFs of nucleosome positioning data at random sites and nucleosome binding sites is 0.1699 (Figure 3(b)) and drops to 0.0379 after shifting (Figure 3(d)). We also measured the Kullback-Leibler divergence (KL divergence) between each density pair. The KL divergence between PDFs of DNA duplex stability scores at random sites and double and single strand binding sites are 0.1868 and 0.6617 (Figure 3(a)), which decreases to 0.1037 and 0.1065, respectively, after shifting (Figure 3(c)). Likewise, the KL divergence between PDFs of nucleosome positioning data at random sites and nucleosome occupied sites drops from 0.1830 to 0.0330 after shifting, as represented by Figures 3(b) and 3(d), respectively. These results show that no information is gained from the shifted data sources. Taken together with the cross-validation results shown above, this demonstrates that the improved binding prediction accuracy is not an artifact of overfitting.
We further compared the scaling parameter (see (11) in Section 2) when integrating different nucleosome data and DNA duplex stability data into the TF target gene prediction framework. The parameter essentially determines the weight of each individual information source. As shown in Table 1, parameter of nucleosome positioning data obtained from  is higher than that obtained with data from  , which is consistent with results in supplementary Figure (a) where nucleosome data from  clearly provides more information than those obtained from . Similarly, parameter of DNA duplex stability data for TFs with single-strand binding pattern is higher than that for TFs with double-strand binding pattern . This is again consistent with results shown in Figure 3(a), where DNA stability energies of single-strand binding TFs provide much better discrimination than those of double-strand binding TFs. These results show that the scaling parameter has an association with data quality, where a higher indicates a more informative data.
3.2. Multiple Data Fusion Method
We next briefly demonstrate the performance of the new data fusion method and compare it with that of a standard weighting-based scheme proposed in . Qualitatively, the previous data fusion method is based on a type of averaging where a genomic location is suggested to contain a binding site only if a large majority of the additional data sources indicate a binding site, whereas the new method can assign more prior probability to a genomic location if it is indicated as a binding site by a few (or even a single) more informative data sources (see Section 2 for a detailed technical description of our data fusion methods).
The performance of the old and new data fusion methods are illustrated in supplementary Figure , which shows the ROC curves for finding the verified binding sites in the gene promoters set using both evolutionary conservation and regulatory potential. Parameters in supplementary Figures (a) and (c) are chosen by the whole AUC and the AUC30, respectively. Supplementary Figure (a) shows that the new method works better than the old one by generating higher overall AUC, and supplementary Figure (c) demonstrates that the new method can improve the prediction accuracy especially in low false positive rate (FPR) region, which is a highly preferable property in general.
Supplementary Figures (b) and (d) show the histograms of the predicted binding probabilities for both the old and new data fusion methods, where the parameters in Figures (b) and (d) are selected according to the whole AUC and AUC30, respectively. Histograms are drawn separately for negative and positive cases and, hence, these graphs clearly demonstrate how well the two methods are able to discriminate the target genes that contain known binding sites from nontarget genes that do not contain binding sites. From these graphs, we can see that the new method improves discrimination by assigning much smaller binding probabilities for sequences with no known binding sites (no matter whether AUC or AUC30 is used), which thus results in much smaller false positive rate. AUC scores for single and all combinations of multiple data sources are summarized in an ascending order in Table 1, and their corresponding data fusion results are shown and discussed in the following sections.
3.3. Comparison of Combinations of Information Sources
In order to better understand that which combinations of additional genome-level data sources are most informative of TFBSs, we compared the TF target gene prediction accuracy of all possible combinations among evolutionary conservation, regulatory potential, nucleosome locations, and DNA duplex stability. The best combination is conservation and nucleosome positioning, whose results have already been shown in supplementary Figure (e).
Results for all the six duplets of data sources are reported in supplementary Figure , which shows that most of the combinations of two data sources work better than their corresponding single data sources except for the combination of nucleosome occupation and DNA energy. This suggests that certain redundancy might exist between nucleosome occupation and DNA energy, which is not entirely surprising since a DNA region that is not within a nucleosome is likely to need less energy to destabilize the two strands than DNA within a nucleosome. This motivates us to group the four information sources into two categories, where group 1 includes evolutionary conservation and regulatory potential, and group 2 includes nucleosome locations and DNA duplex stability. Our results indicate that when a pair of data sources come from different groups, that is, have little redundancy, their joint performance can be better than those of their corresponding single data sources. Moreover, the best performance is achieved with a pair of additional data sources (supplementary Figure (b)), and adding more information sources into this pair cannot further improve the accuracy. The above results and analysis suggest that combining data sources that are redundant does not necessarily improve the overall performance. In other words, in order to gain a better prediction accuracy it is better to combine data sources that provide information from different perspectives of the same biological system.
Results for all four triplets of data sources are shown in supplementary Figure , which all perform better than their corresponding single data sources. It is seen that the best result is obtained by combining conservation, regulatory potential and nucleosome positioning, which accords well with our expectation since "conservation and regulatory potential" is the most informative pair in the lower false positive region (supplementary Figure (f)), and "nucleosome positioning, and regulatory potential" forms the best pair with respect to higher false positive region (supplementary Figure (d)).
Supplementary Figure shows the ROC curve for the only quartet. Although one could expect that adding more information sources into TF target gene prediction always improves the prediction accuracy, our results show that it is not always the case. This finding is understood by realizing the difficulty of combining complex and poorly characterized genome-level data sources into TF target gene prediction.
We have three main contributions in this paper. Firstly, we have developed a new data integration method for TF target gene prediction from multiple data sources. The new method is compared with the one employed in  using a TF target gene prediction algorithm called ProbTF , and the results show that the new data fusion principle improves the previous method by lower false positive rate. Secondly, we have demonstrated the use of two novel information sources, DNA duplex stability and raw nucleosome occupancy predictions from a method proposed in , to guide TF target gene predictions. Our results show that both nucleosome occupancy and DNA stability data can improve TF target gene prediction accuracy especially when combined with evolutionary conservation or with conservation and regulatory potential. Moreover, more accurate nucleosome predictions result in better TF target gene predictions. It is also worth noticing that we do not distinguish different TFs regarding data source usage except for DNA duplex stabilities, where double or single strand binding proteins are treated differently and a heuristic method is adopted to classify them. Thirdly, we have compared all the possible combinations among conservation, regulatory potential, nucleosome positioning and DNA stability, whose results can be availed in data source selection or preparation when dealing with data integration problem in a particular application. We grouped the four tested information sources into two categories based on biological arguments: group 1 contains conservation and regulatory potential, and group 2 consists of nucleosome locations and DNA duplex stability. We found that combining data from different groups is more likely to improve TF target gene predictions presumably because data sources between the two groups are less redundant.
Although the assumption that all TFs bind to DNA in double-strand manner works well in yeast , it may not be sufficient in higher organisms, such as mouse, as shown in this study (see, e.g., Figure 3(a)). Instead, we obtained informative DNA duplex stability prior by assuming different binding preferences for different TFs. We constructed the binding preference of each TF with a simple heuristic which assesses the binding preference for a TF from a set of known binding sites. We have used cross-validation and an additional base pair shifting simulations to show that binding preference prediction and parameter optimization do not result in any (optimistic) bias, or overfitting, in binding prediction accuracy. However, the use of the DNA duplex stability data is limited because little verified information about TF binding specificities can be found from the literature and, therefore, binding specificities need to be learned from the data as well which currently requires that a set of verified binding sites is known. Future research goals include to develop an (unsupervised) algorithm for predicting the binding preference for TFs without prior knowledge of the known binding sites. Moreover, it is possible that one TF may have multiple folding modes, and can bind different sequences with different patterns. For example, MyoD, a member of helix-loop-helix protein family, can not only recognize the double-stranded DNA-binding site (called E-box) in many muscle and nonmuscle genes, but also bind to the noncoding strand of an E-box from the muscle-specific creatine kinase enhancer in a single-stranded manner . To take this possibility into account, a more sophisticated assumption can be applied; that is, TFs can have different binding preferences to different sequences or under different experimental conditions. In this direction, we can also try to incorporate other data sources, such as ChIP-chip data, into our data fusion framework.
Nucleosome positioning data is employed in this study assuming that nucleosomes compete with DNA binding proteins  for target DNA binding sites. Although this assumption is generally true, we could not exclude the possibility that some TFs may selectively bind to nucleosome-occupied regions. Binding sites of such TFs, if exist, can not be recognized by the method presented here when employing nucleosome occupancy data, but can be rescued, for example, by incorporating other information sources.
X. Dai and H. Lähdesmäki designed the study and prepared the paper. O. Yli-Harja participated in the study design. X. Dai developed the new data fusion method, implemented the two novel data sources in TF target gene prediction, and performed all the simulations.
Hayashi Y, Sano N, Horikoshi M: A genomic code for nucleosome positioning. Chemtracts 2007, 19(6):223-233.
Yuan GC, Liu JS: Genomic sequence is highly predictive of local nucleosome depletion. PLoS Computational Biology 2008., 4(1, article e13):
Blanco E, Farré D, Albà MM, Messeguer X, Guigó R: ABS: a database of Annotated regulatory Binding Sites from orthologous promoters. Nucleic Acids Research 2006, 34: D63-D67. 10.1093/nar/gkj116
Wingender E, Chen X, Hehl R, Karas H, Liebich I, Matys V, Meinhardt T, Prüß M, Reuter I, Schacherer F: TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Research 2000, 28(1):316-319. 10.1093/nar/28.1.316
Montgomery SB, Griffith OL, Sleumer MC, Bergman CM, Bilenky M, Pleasance ED, Prychyna Y, Zhang X, Jones SJM: ORegAnno: an open access database and curation system for literature-derived promoters, transcription factor binding sites and regulatory variation. Bioinformatics 2006, 22(5):637-640. 10.1093/bioinformatics/btk027
MacIsaac KD, Fraenkel E: Practical strategies for discovering regulatory DNA sequence motifs. PLoS Computational Biology 2006, 2(4):e36. 10.1371/journal.pcbi.0020036
Berger MF, Philippakis AA, Qureshi AM, He FS, Estep PW III, Bulyk ML: Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nature Biotechnology 2006, 24(11):1429-1435. 10.1038/nbt1246
Harbison CT, Gordon DB, Lee TI, Rinaldl NJ, Macisaac KD, Danford TW, Hannett NM, Tagne J-B, Reynolds DB, Yoo J, Jennings EG, Zeitlinger J, Pokholok DK, Kellis M, Rolfe PA, Takusagawa KT, Lander ES, Gilford DK, Fraenkel E, Young RA: Transcriptional regulatory code of a eukaryotic genome. Nature 2004, 430(7004):99-104.
Qi Y, Rolfe A, MacIsaac KD, Gerber GK, Pokholok D, Zeitlinger J, Danford T, Dowell RD, Fraenkel E, Jaakkola TS, Young RA, Gifford DK: High-resolution computational models of genome binding events. Nature Biotechnology 2006, 24(8):963-970. 10.1038/nbt1233
Narlikar L, Gordân R, Hartemink AJ: A nucleosome-guided map of transcription factor binding sites in yeast. PLoS Computational Biology 2007, 3(11):e215. 10.1371/journal.pcbi.0030215
Gordân R, Hartemink AJ: Using DNA duplex stability information for transcription factor binding site discovery. In Proceedings of Pacific Symposium on Biocomputing (PSB '08), 2008. World Scientiffic; 453-464.
Lähdesmäki H, Rust AG, Shmulevich I: Probabilistic inference of transcription factor binding from multiple data sources. PLoS One 2008., 3(3, article e1820):
Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Research 2005, 15(8):1034-1050. 10.1101/gr.3715005
Taylor J, Tyekucheva S, King DC, Hardison RC, Miller W, Chiaromonte F: ESPERR: learning strong and weak signals in genomic sequence alignments to identify functional elements. Genome Research 2006, 16(12):1596-1604. 10.1101/gr.4537706
Benham CJ, Bi C: The analysis of stress-induced duplex destabilization in long genomic DNA sequences. Journal of Computational Biology 2004, 11(4):519-543. 10.1089/cmb.2004.11.519
Bi C, Benham CJ: WebSIDD: server for predicting stress-induced duplex destablized (SIDD) sites in superhelical DNA. Bioinformatics 2004, 20(9):1477-1479. 10.1093/bioinformatics/bth304
Lee C-K, Shibata Y, Rao B, Strahl BD, Lieb JD: Evidence for nucleosome depletion at active regulatory regions genome-wide. Nature Genetics 2004, 36(8):900-905. 10.1038/ng1400
Wasserman WW, Sandelin A: Applied bioinformatics for the identification of regulatory elements. Nature Reviews Genetics 2004, 5(4):276-287. 10.1038/nrg1315
Ollis DL, White SW: Structural basis of protein-nucleic acid interactions. Chemical Reviews 1987, 87(5):981-995. 10.1021/cr00081a006
Wisedchaisri G, Holmes RK, Hol WGJ: Crystal structure of an IdeR-DNA complex reveals a conformational change in activated IdeR for base-specific interactions. Journal of Molecular Biology 2004, 342(4):1155-1169. 10.1016/j.jmb.2004.07.083
Duncan R, Bazar L, Michelotti G, Tomonaga T, Krutzsch H, Avigan M, Levens D: A sequence-specific, single-strand binding protein activates the far upstream element of c-myc and defines a new DNA-binding motif. Genes and Development 1994, 8(4):465-480. 10.1101/gad.8.4.465
Finocchiaro LME, Amati P, Glikin GC: Single strand binding protein specific for the polyoma early-coding strand of PEA1 (AP1) regulatory sequence. Nucleic Acids Research 1991, 19(15):4279-4287. 10.1093/nar/19.15.4279
Heimberger AB, McGary EC, Suki D, Ruiz M, Wang H, Fuller GN, Bar-Eli M: Loss of the AP-2 α transcription factor is associated with the grade of human gliomas. Clinical Cancer Research 2005, 11(1):267-272.
Christy B, Nathans D: DNA binding site of the growth factor-inducible protein Zif268. Proceedings of the National Academy of Sciences of the United States of America 1989, 86(22):8737-8741. 10.1073/pnas.86.22.8737
Walsh K, Gualberto A: MyoD binds to the guanine tetrad nucleic acid structure. Journal of Biological Chemistry 1992, 267(19):13714-13718.
Sabourin LA, Rudnicki MA: The molecular regulation of myogenesis. Clinical Genetics 2000, 57(1):16-25.
Perkins KJ, Burton EA, Davies KE: The role of basal and myogenic factors in the transciptional activation of utrophin promoter A: implications for therapeutic up-regulation in Duchenne muscular dystrophy. Nucleic Acids Research 2001, 29(23):4843-4850. 10.1093/nar/29.23.4843
Chen H, Li B, Workman JL: A histone-binding protein, nucleoplasmin, stimulates transcription factor binding to nucleosomes and factor-induced nucleosome disassembly. EMBO Journal 1994, 13(2):380-390.
Li Q, Wrange O: Translational positioning of a nucleosomal glucocorticoid response element modulates glucocorticoid receptor affinity. Genes and Development 1993, 7(12A):2471-2482. 10.1101/gad.7.12a.2471
Lee W, Tillo D, Bray N, Morse RH, Davis RW, Hughes TR, Nislow C: A high-resolution atlas of nucleosome occupancy in yeast. Nature Genetics 2007, 39(10):1235-1244. 10.1038/ng2117
Schones DE, Cui K, Cuddapah S, Roh T-Y, Barski A, Wang Z, Wei G, Zhao K: Dynamic regulation of nucleosome positioning in the human genome. Cell 2008, 132(5):887-898. 10.1016/j.cell.2008.02.022
The authors would like to thank Yuan Guo-Cheng for providing us his software for nucleosome occupation prediction. This work was supported by Tampere Graduate School in Information Science and Engineering (TISE) (XFD) and the Academy of Finland (Grant no. 213462).
About this article
Cite this article
Dai, X., Yli-Harja, O. & Lähdesmäki, H. Novel Data Fusion Method and Exploration of Multiple Information Sources for Transcription Factor Target Gene Prediction. EURASIP J. Adv. Signal Process. 2010, 235795 (2010). https://doi.org/10.1155/2010/235795
- Data Fusion
- Binding Preference
- Nucleosome Occupancy
- Additional Data Source
- Multiple Information Source