- Research Article
- Open Access
Novel Data Fusion Method and Exploration of Multiple Information Sources for Transcription Factor Target Gene Prediction
© Xiaofeng Dai et al. 2010
- Received: 17 April 2010
- Accepted: 10 August 2010
- Published: 17 August 2010
Background. Revealing protein-DNA interactions is a key problem in understanding transcriptional regulation at mechanistic level. Computational methods have an important role in predicting transcription factor target gene genomewide. Multiple data fusion provides a natural way to improve transcription factor target gene predictions because sequence specificities alone are not sufficient to accurately predict transcription factor binding sites. Methods. Here we develop a new data fusion method to combine multiple genome-level data sources and study the extent to which DNA duplex stability and nucleosome positioning information, either alone or in combination with other data sources, can improve the prediction of transcription factor target gene. Results. Results on a carefully constructed test set of verified binding sites in mouse genome demonstrate that our new multiple data fusion method can reduce false positive rates, and that DNA duplex stability and nucleosome occupation data can improve the accuracy of transcription factor target gene predictions, especially when combined with other genome-level data sources. Cross-validation and other randomization tests confirm the predictive performance of our method. Our results also show that nonredundant data sources provide the most efficient data fusion.
- Data Fusion
- Binding Preference
- Nucleosome Occupancy
- Additional Data Source
- Multiple Information Source
A central problem in molecular system biology is to understand the manner in which a cell operates its complex transcriptional machinery. At molecular level, transcriptional processes are largely controlled by transcription factors (TFs) that bind to gene promoters in a sequence-specific manner and, thereby, inhibit or promote the expression of their target genes. Collectively, these DNA-binding proteins and other molecules work together to implement the complex regulatory machinery that controls gene expression. Since large-scale understanding of transcriptional regulation is still severely limited even in lower organisms, it is of great importance to reveal these regulatory protein-DNA interactions genomewide.
Experimentally verified TF-binding sites (TFBSs) have been collected in databases [3–5] and recently developed experimental methods, such as ChIP-chip or ChIP-seq, are capable of measuring in vivo TFBSs in high-throughput manner. However, it is not possible to obtain sufficient coverage, that is, to screen all TFs under all conditions, using experimental methods alone. Therefore, the binding site prediction problem calls for computational methods. Computational predictions rely on sequence specificities that are typically taken from a database  or obtained as an output from a motif discovery method . Recent progress on experimental side has made it also possible to measure TF-binding specificities in high-throughput manner . The advent of these experimental techniques equips TF target gene prediction methods with much more accurate binding specificity models and, indeed, opens a whole new avenue for computational analysis of TF-DNA binding.
Sequence specificities alone, however, are not sufficiently informative to accurately predict TFBSs simply because the probability of observing an exact copy of a presumably functional binding motif in a genome by chance is remarkably high. A natural way to improve TF target gene predictions is to incorporate additional information into statistical inference of TFBSs. A number of additional data sources can be useful for this purpose, including, among others, information on coregulated genes, evolutionary conservation, physical binding locations as measured by ChIP-chip or ChIP-seq, nucleosome occupancies, CpG islands, regulatory potential, DNase hypersensitive sites, and so on. Incorporating additional information sources to guide statistical inference has successfully been made use of in the context of motif discovery [8–11], but has not attracted enough attention in TF target gene prediction. We have recently developed a probabilistic TF target gene prediction method, ProbTF, which can incorporate practically any additional genome-level information source to predict TF target gene .
Statistical data fusion for TF target gene prediction becomes more challenging in the case of multiple information sources. Here we develop a new method for multiple data fusion and incorporate novel data sources into TF target gene prediction. Four genome-level additional information sources (i.e., information at the level of individual nucleotides), evolutionary conservation, nucleosome positioning data from a recently published computational method, regulatory potential, and DNA duplex stability, are employed here to improve TF target gene prediction, which is expected to be informative of binding sites as will be discussed shortly. Some of these and other individual data sources have already been shown to improve de novo motif discovery [8–11]. Here we demonstrate how multiple data sources can be combined to make joint statistical inference of TF target gene. Integration of data sources that have a probabilistic interpretation is relatively straightforward , and for other data sources we convert the raw data into probabilities, or prior distributions, by extending a previously proposed Bayesian transformation method . In addition, for efficient use of DNA duplex stability data, we propose a simple heuristic that can assess the binding preference (single versus double-stranded DNA) for a TF from a set of known binding sites. Results on a carefully constructed set of verified binding sites in mouse genome [3, 5, 12] demonstrate that the new data fusion method that we propose here improves the performance of TF target gene prediction methods. We also demonstrate that a number of genome-level data sources, either alone or especially in combination, are highly informative of TF target gene. Consequently, our statistical data fusion method can gain valuable new insights into genomewide models of transcriptional regulatory networks.
Figure 1 shows that the highest log-likelihood score is not always obtained at the annotated binding site. TFs are commonly associated with multiple PSFMs since one TF may allow certain variation in its binding motif. Thus, it can be difficult to combine predictions from multiple PSFMs given that these PSFMs may be extremely similar or different. This issue can be solved by, for example, ProbTF method, which implements an intuitive way of combining predictions by multiple PSFMs: ProbTF considers all possible numbers of nonoverlapping TFBSs in all possible locations and configurations and weights each configuration according to its probability. A more difficult problem is to decide that which of the peaks predicted by PSFMs correspond to real, functional binding sites. As illustrated in Figure 1, the PSFM-based profiles have relatively good sensitivity but poor specificity, which is common to many PSFMs. The lack of specificity can be greatly improved by genome-level data fusion, which forms the focus of this study.
Corresponding to what is known about transcriptional regulation, many of the verified binding sites typically have high degree of conservation  and high regulatory potential scores  and are typically free of stable nucleosomes (i.e., have low nucleosome occupancy scores) . Moreover, DNA double helix destabilization energies at TF binding sites are different from those at random sites . In particular, TFBSs tend to have high DNA duplex stability score if a TF prefers to bind both strands of the promoter sequence (Figures 1(a) and 1(b)) and low DNA stability score in the opposite case (Figures 1(c) and 1(d)) The above reasoning seems to provide a simple logic for filtering the real TFBSs.
However, correlation between TFBSs and any of the additional data sources cannot be expected to be perfect even from a biological point of view. For example, only about 50% of functional binding sites are assessed to be evolutionary conserved . The additional information sources are also noisy, regardless of whether they are experimental measurements or computational predictions. The only possibility is to make statistical inference, which takes the inherent randomness into account, from multiple genome-level data sources. The rationale is that the accuracy of computational TF target gene predictions naturally improves when more (useful) information is incorporated into statistical analysis.
2.1. Probabilistic Framework for TF Target Gene Prediction
We first describe the TF target gene prediction algorithm employed in this study (full details can be found from ). Let denote a single strand of a promoter sequence, where and is the length of the sequence (generalization to double-stranded DNA sequences is also possible but omitted here). Let denote the number of (unknown) binding sites and the (hidden) start positions of nonoverlapping binding sites in sequence ; that is, if then .
where . We use . This formula defines the (user definable) prior expectation of the number of binding sites in a given DNA sequence. Importantly, it does not incorporate any of the informative data sources studied here. This prior, primarily only, increases or decreases of the estimated binding probabilities, and as such has little effect on, for example, the ROC curves. The probability is obtained with the assumption that, for a fixed value of , the prior over binding site positions and configurations is uniform and inversely proportional to the number of different binding site positions and configurations. The probability is obtained by summing over all possible positions and configurations, and can be computed efficiently using a recursive formula .
where and . Consequently, the same efficient recursive algorithm can be used to compute (see  for more details).
Note that the choice of Markovian and PSFM models is arbitrary. Also note that since additional data are incorporated using probabilities of binding over the promoter sequence; we could also employ methods other than ProbTF.
2.2. Data Integration Method
The data integration method is parameterized by and . Note that and . It is also worth noting that the resulting probabilities do not include hard thresholding for any of the genomic locations although thresholding is involved in integration, and the use of thresholding during the construction is motivated by its simple yet powerful parametrization.
where is the probability that a DNA site is a TFBS ( ) given the value of the raw data . For conservation and regulatory potential the original data are already in a probabilistic format, and for nucleosome and DNA stability data the conversion of the raw data into probabilities was described in the previous sections. Probability is the final integrated prior probability for position after scaling, which is directly used in further TFBS prediction as explained, for example, in (6) and (7).
AUC scores and scaling parameters for all data sources and their combinations. Data source combinations from 0 to 4 information sources are colored grey, green, blue, yellow, and magenta, respectively. "a" and "b" are the multiplicative factor and bias term, respectively, for scaling each additional data source, and "c" is the scaling parameter used for combining multiple information sources into the TF target gene prediction framework. All the parameters shown here are selected with respect to the largest AUC scores.
The scaling parameters, that is, " ", " " and " ", are relatively robust, whose slight variations would not dramatically affect the results. We varied " " of the DNA duplex stability data (for both double and single strand binding data), which is supposed to have more effect on the results (recall that " " weights different information sources and reflects their importance), and listed its AUC scores for single data source as well as its combination with other additional information sources in supplementary Table . It is clear that with small changes of " ", the results do not vary significantly. However, for the weighting parameters, that is, " " to " ", and the threshold, , their small changes may have greater effect on the results, since they determine how different data sources are combined. This can be seen from the closer values among " " to " ", which are 0.72, 0.72, and 0.73, respectively. These parameters depend heavily on the quality and type of data, and should be optimized before data integration.
2.3. DNA Duplex Stability Data
The DNA stability measures the amount of energy needed to separate the two strands of DNA. In this study the DNA destabilization energies were obtained from an online tool WebSIDD [15, 16], where the parameters were set to "DNA Type: circular", "Energetic Type: near neighbor", and "Energy Cutoff: level 4". Note that circular DNA is assumed to calculate the duplex stabilities of linear DNA. This is because WebSIDD handles linear DNA similarly with circular DNA but adding 50 G/C to the end, which is not needed here given the extended DNA used. We obtained the energy score for each sequence with 1 kb extension from both ends. For every binding site we computed the energy of destabilization as the average of the destabilization values for all positions within this site.
2.3.1. Assessing Binding Preference for Each TF
Relatively little is reported about specific types of protein-DNA interactions in the literature and the protein domain annotations are not available for all TFs, thus, we decided to assess the binding preference for each factor simply by looking at the DNA stability scores at the known binding sites in the test data set. With the assumption that the binding preference of a TF is the same to all its binding sites, we estimated the binding preference of each TF with the following heuristic. Let denote the set of all known start binding positions of a TF among all the tested sequences in our test set. For all the known binding sites in , we compute counts and which are the number of times and , respectively, where is the width of the verified binding site in the test set and is the threshold specified by quantile . Then, the TF is assigned to bind in a double-strand manner if , in a single-strand manner if , and in cases , random preference is assigned. In order to make the above heuristic more robust, we repeated it for three thresholds specified by different quantiles with both raw and smoothed DNA duplex stability scores. The final binding preference of each TF is made by taking a vote among these six binding preferences, and again in case of a tie random binding preference is assigned.
2.3.2. Construction of DNA Duplex Stability Prior
We built three data sets to construct the DNA duplex stability priors: one positive single-strand binding data set, one positive double-strand binding data set, and one background data set. The positive data sets are constructed from 226 known binding sites in our test data set by splitting the known binding sites into single- and double-strand binding sets according to the binding preference of each TF. The background data set is generated as follows. For each verified binding site in our test set, we randomly select 20 genomic locations (from the same promoter sequence) with the average binding site of length 12, which results in a background set that is 20 times larger than the test set.
2.4. Nucleosome Occupation Data
2.4.1. Construction of Nucleosome Occupation Prior
We built the nucleosome occupation prior in a similar way as what we did with the DNA stability data, but with only two data sets: positive and background (see also ). The positive data set consists of the averaged -scores (the raw nucleosome occupancy scores obtained using the method in ) of the known binding positions. The background data set is composed of the averaged -scores of randomly selected genomic locations in the same way as above. For every occupation score , the conditional probabilities for binding and nonbinding sites are denoted as and , respectively. The CDFs of the two nucleosome data sets are shown in Figure 3(b), which indicate that the nucleosome positioning information from  is informative of TFBSs. The probability that a DNA site is a TFBS given its nucleosome occupation score is obtained by (13) (with replaced by ). Note that .
Sequences used in this study. One "TFBS duplex stability score" is computed as the average of all the raw DNA duplex stability scores over a given TFBS. The TFBS duplex stability scores are computed for all the binding sites of a promoter sequence. Note that one sequence can have multiple binding TFs and TFBSs, one TF can bind to more than one site, and one TFBS may be recognized by multiple TFs.
TFBS duplex stability scores
Sp1, Ap2, Egr1
10.03, 9.98, 9.70, 10.03, 10.10, 9.66
Srf, Tef, Sp1, Tead1, Sre, Tbp
7.66, 7.71, 7.31, 7.69, 7.71, 7.94,
7.66, 7.77, 7.46, 6.67, 8.09, 7.73, 5.18
Carg-d, Prm, Carg-c
9.70, 10.01, 9.76
Sp1, Myod1, Srf, Tbp
9.93, 9.82, 9.73, 9.72, 9.23, 9.94, 9.82, 8.90
Tcf1, Cebp, Hnf1, Cebp
9.17, 9.80, 9.44, 9.71, 9.64, 9.20
Myf, E1, E2, E3
9.85, 9.91, 9.93, 9.92, 9.92
Myf, Tef, E1
9.59, 9.78, 9.77, 9.91
9.90, 9.88, 9.81, 9.76, 9.55, 9.61
Myf, E1, E2, E3, E4
9.74, 9.84, 9.64, 9.28, 9.83, 9.86
Srf, Nvl, Mef, Prrx1, Myog, Myod1,
9.78, 9.89, 8.15, 8.63, 8.06, 9.94, 9.94, 9.94, 9.95,
Myf5, Mef1, Ap2, Myf, Carg3,
9.95, 9.35, 9.94, 9.95, 9.80, 9.97, 9.74, 9.94,
Mef2-left (-right), E-left (-right), Trp53
9.69, 8.34, 8.53, 9.80, 9.95, 9.96, 9.87, 9.70
E1, Mef2c, Myod1, Tbp
9.88, 8.23, 6.49, 9.88, 9.88, 8.66
4.69, 4.48, 3.92
Srf, Ap1, Creb
3.99, 5.71, 7.09
Usf, Egr1, Ap2a, Tbp
8.55, 9.60, 9.57, 4.16
Cebp, Nfya, Tbp
2.54, 2.50, 4.34
Myod1, Mef2, E2, Tbp
8.90, 9.66, 9.73, 8.78, 9.62, 8.34
Myf, Myog, Myf5, Myod1
9.62, 9.63, 9.77, 9.63, 9.77, 9.77, 9.63
Mef, Tef, Srf, Mef2, Tead1
7.60, 9.75, 7.60, 8.80, 7.60, 9.75, 8.94
E4, E1, Carg
9.97, 9.94, 9.14
Ap2, Gc2, Ccaat-box, Sp1, Tbp
9.91, 9.98, 9.55, 9.99, 8.35, 9.55, 9.98
Myf, Mef, Mef2, E1, Def-2, Myog, Tbp, Myod1
9.03, 7.21, 9.87, 7.00, 9.79, 7.01, 9.79, 9.79,
9.02, 7.00, 7.03, 8.31, 9.79
Cef-2, Sp1, Mef2, Mef3, Gata4
9.04, 9.25, 6.54, 9.52, 9.19, 6.54, 9.49, 8.54
Cebp, Tcf1, Hnf2, Hnf3, Cebp, Hnf4, Hnf1
9.74, 9.47, 9.50, 9.38, 9.13, 9.74, 9.31, 9.38,
9.50, 9.45, 9.12
Sp1, Cebp, Tbp
10.05, 9.85, 8.78
Nfkb1, Ap1, Nfat
6.52, 6.63, −0.40, −0.33
Hnf1, Ipf1, Creb, Tbp
5.43, 5.57, 7.27, 8.21
6.38, 0.35, 0.32
Oct, Aml, Egr1
2.74, 3.65, 9.77
9.89, 9.84, 9.81, 9.89, 9.70
Sp1, Ap2a, Tbp
9.46, 6.08, 4.55, 3.14, 2.40, 4.10
10.03, 9.44, 10.02
Myb, Caat, Tbp
9.90, 9.74, 6.76, 6.25
10.03, 9.94, 9.92
5.42, 8.60, 7.87, 9.87
8.41, 9.10, 8.80, 1.90
In this section, the results of exploring two novel additional data sources, evaluating the new data fusion method and comparison among different data source combinations in TF target gene prediction are sequentially reported and discussed. The idea of our computational methods is to probabilistically bias the search of binding sites to those genomic locations that are more likely to contain binding site(s) in light of the additional data. The qualities of the TF target gene prediction results are evaluated by the ROC curves and the histograms of the estimated binding probabilities, which are drawn from the probabilities over all the TFs and the sequences being analyzed. The test data set used throughout this paper consists of 47 promoter sequences, each contains a varying number of annotated binding sites from ABS  and ORegAnno  databases.
3.1. Novel Informative Data Sources
3.1.1. DNA Duplex Stability Prior
Most sequence-specific DNA binding proteins contact with the major groove of double stranded DNA in the B conformation , and some TFs are shown to bind DNA in a double-strand manner according to their crystal structures . Thus, the DNA destabilization energies at protein binding sites of these TFs are expected to be high. This assumption has been verified in yeast by  on improving the accuracy of TFBS discovery, which is a different topic other than TF target gene prediction. On the other hand, during transcription, the two DNA strands must be separated to let RNA polymerase slide along the DNA molecule and synthesize a nascent mRNA. Since the binding sites for many general TFs are located in the proximal promoter regions of the transcribed gene, it is expected that the DNA double helix of these regions is low, that is, low DNA duplex stability. Besides, there also exists experimental evidence showing that some regulatory proteins bind to DNA in a single-strand manner [21, 22]. Taken together, these suggest that DNA duplex stability data should be informative of binding sites; whether a lower or higher DNA duplex stability at specific TF binding sites is more preferable depends largely on the binding preference of the TF, that is, whether the TF binds to the the DNA in a double- or single-strand manner. In our study, we assume that TFBSs for TFs with single-strand binding preference occur preferentially in regions with low DNA duplex stability, and the other way around for double-strand binding TFs.
In the TF target gene prediction analysis, the raw DNA duplex destabilization energies were converted into probability values using a Bayesian transformation method, and each TF's binding preference is predicted with a heuristic method (see Section 2 for details).
Transcription factors used in this study. "1" and "2" each represents that the corresponding TF binds to DNA in a single and double strand manner, respectively. Empty blank means no literature information is found.
GCCGGAGG, CCGCCGGGGTGG, CCCAGGG
CCCAACACCTGCTGCCTGAGCC, CATCTG, CAGTTG
CAACTG, (ATTAACCCA)GACATGTGGC(TGCCCC), CATCTG,
(CCCCCCAA)CACCTGCTG(CCTGAGCC), CACTTG, CAGTTG
(C)CCCAACACCTGCTGCCTGAGCC, CATCTG, CAGTTG
CCGCCC, CCCCACCCCCTGCA, GCGCCAGGGCTGGGCTCCT,
TCCTGAAGACCCGCCCTTTTTC, GGCAGAG, CAACC,
(AGGG)TGGGCAG(TCC), GAGGTGGGGGG, AGCCAG,
TGCTTCCCATATATGGCCATGT, CCATATTAGG, CTATTATGG
(C)TATAAA(A), TACAAAT, TTAAA,
ATAAATA, TTAAAT, TATAAG
3.1.2. Nucleosome Occupation Prior
Chromatin structure has an important role in regulating the transcriptional machinery. At the genome level, these mechanisms are controlled by the basic structural subunits, nucleosomes, which can limit the access of TFs to their binding sites [1, 17]. Thus, from the viewpoint of computational TFBS prediction, the likelihood of a TF binding to nonfunctional sites can be decreased by locating a stable nucleosome over those genomic regions while keeping functional sites accessible for TF binding. The validity of this assumption can be verified, for example, by the fact that the binding of SP1, GAL4, and USF to nucleosome cores requires other proteins such as nucleoplasmin to remove H2A and H2B which consequently results in nucleosome disassembly , and proven by the evidence that the binding propensity of glucocorticoid receptor (GR) to the nucleosome core is much lower than that to the nucleosome free sequence . However, the probability of some TFs binding equally well or even better to sequences occupied by nucleosomes compared with nucleosome free regions could not be excluded, where nucleosome location data alone will not be sufficient and multiple data sources may be used to improve the prediction accuracy.
High-resolution genomewide nucleosome positioning data exist for organisms such as yeast  and human , but in the case of mouse, we currently need to rely on computational predictions. Indeed, this computational prediction problem has attracted lots of interest and improved methods have been proposed recently. ProbTF method was previously tested with predicted nucleosome locations from Segal's original model which rely on dinucleotide frequencies  and the nucleosome data was not found to be informative of binding sites. Here we explore the problem that whether more recent and more accurate nucleosome positioning data together with a novel data fusion method can improve TF target gene prediction. In this study, we used a computational multiresolution method developed in  to predict the nucleosome locations for all the 47 tested sequences. We decided to use the raw nucleosome positioning data, that is, without the hidden Markov model (HMM) processing, and employ the extended sequences to obtain the -score for each genomic location. The raw data were further converted into probabilities using a Bayesian transformation method (for details see Section 2).
We compared the two different nucleosome data by integrating them separately into our TF target gene prediction algorithm. It is particularly promising to see that the use of more accurate nucleosome positioning data from  results in more accurate TF target gene prediction as shown in supplementary Figure (a). Similarly as in the case of DNA duplex stability data, we combined nucleosome data with conservation (supplementary Figure (e)) or regulatory potential data (supplementary Figure (c)), and the combined data again improve the TF target gene predictions. For example, the AUC score of 0.7555 which is obtained with nucleosome data alone increases to 0.7946 when combined with regulatory potential, and jumps to 0.8334 when combined with conservation.
In order to gain insight into each individual data source and to assess the extent of possible overfitting problem stemming from parameter optimization, we also prepared an additional control simulation. We shifted each additional data source by 100 base pair positions and then applied our computational methods as explained above, including binding preference prediction and optimization of parameters, to test performance of randomized data. ROC curves corresponding to the four shifted information sources are shown in supplementary Figure and the AUC scores after shifting for each data source are recorded in Table 1. For the two novel data sources, we also compared their CDFs after shifting with the original ones as shown in Figure 3. The Kolmogorove-Smimov statistic (KS statistic) for CDFs of DNA duplex stability scores at random sites and double strand binding sites is 0.3097, and that of random sites and single strand binding sites is 0.3641 (as depicted in Figure 3(a)). However, after shifting, the KS statistics between random locations and double strand binding sites and between random sites and single strand binding sites become 0.1905 and 0.1168, respectively, (see Figure 3(c)). Similarly, the KS statistic between the CDFs of nucleosome positioning data at random sites and nucleosome binding sites is 0.1699 (Figure 3(b)) and drops to 0.0379 after shifting (Figure 3(d)). We also measured the Kullback-Leibler divergence (KL divergence) between each density pair. The KL divergence between PDFs of DNA duplex stability scores at random sites and double and single strand binding sites are 0.1868 and 0.6617 (Figure 3(a)), which decreases to 0.1037 and 0.1065, respectively, after shifting (Figure 3(c)). Likewise, the KL divergence between PDFs of nucleosome positioning data at random sites and nucleosome occupied sites drops from 0.1830 to 0.0330 after shifting, as represented by Figures 3(b) and 3(d), respectively. These results show that no information is gained from the shifted data sources. Taken together with the cross-validation results shown above, this demonstrates that the improved binding prediction accuracy is not an artifact of overfitting.
We further compared the scaling parameter (see (11) in Section 2) when integrating different nucleosome data and DNA duplex stability data into the TF target gene prediction framework. The parameter essentially determines the weight of each individual information source. As shown in Table 1, parameter of nucleosome positioning data obtained from  is higher than that obtained with data from  , which is consistent with results in supplementary Figure (a) where nucleosome data from  clearly provides more information than those obtained from . Similarly, parameter of DNA duplex stability data for TFs with single-strand binding pattern is higher than that for TFs with double-strand binding pattern . This is again consistent with results shown in Figure 3(a), where DNA stability energies of single-strand binding TFs provide much better discrimination than those of double-strand binding TFs. These results show that the scaling parameter has an association with data quality, where a higher indicates a more informative data.
3.2. Multiple Data Fusion Method
We next briefly demonstrate the performance of the new data fusion method and compare it with that of a standard weighting-based scheme proposed in . Qualitatively, the previous data fusion method is based on a type of averaging where a genomic location is suggested to contain a binding site only if a large majority of the additional data sources indicate a binding site, whereas the new method can assign more prior probability to a genomic location if it is indicated as a binding site by a few (or even a single) more informative data sources (see Section 2 for a detailed technical description of our data fusion methods).
The performance of the old and new data fusion methods are illustrated in supplementary Figure , which shows the ROC curves for finding the verified binding sites in the gene promoters set using both evolutionary conservation and regulatory potential. Parameters in supplementary Figures (a) and (c) are chosen by the whole AUC and the AUC30, respectively. Supplementary Figure (a) shows that the new method works better than the old one by generating higher overall AUC, and supplementary Figure (c) demonstrates that the new method can improve the prediction accuracy especially in low false positive rate (FPR) region, which is a highly preferable property in general.
Supplementary Figures (b) and (d) show the histograms of the predicted binding probabilities for both the old and new data fusion methods, where the parameters in Figures (b) and (d) are selected according to the whole AUC and AUC30, respectively. Histograms are drawn separately for negative and positive cases and, hence, these graphs clearly demonstrate how well the two methods are able to discriminate the target genes that contain known binding sites from nontarget genes that do not contain binding sites. From these graphs, we can see that the new method improves discrimination by assigning much smaller binding probabilities for sequences with no known binding sites (no matter whether AUC or AUC30 is used), which thus results in much smaller false positive rate. AUC scores for single and all combinations of multiple data sources are summarized in an ascending order in Table 1, and their corresponding data fusion results are shown and discussed in the following sections.
3.3. Comparison of Combinations of Information Sources
In order to better understand that which combinations of additional genome-level data sources are most informative of TFBSs, we compared the TF target gene prediction accuracy of all possible combinations among evolutionary conservation, regulatory potential, nucleosome locations, and DNA duplex stability. The best combination is conservation and nucleosome positioning, whose results have already been shown in supplementary Figure (e).
Results for all the six duplets of data sources are reported in supplementary Figure , which shows that most of the combinations of two data sources work better than their corresponding single data sources except for the combination of nucleosome occupation and DNA energy. This suggests that certain redundancy might exist between nucleosome occupation and DNA energy, which is not entirely surprising since a DNA region that is not within a nucleosome is likely to need less energy to destabilize the two strands than DNA within a nucleosome. This motivates us to group the four information sources into two categories, where group 1 includes evolutionary conservation and regulatory potential, and group 2 includes nucleosome locations and DNA duplex stability. Our results indicate that when a pair of data sources come from different groups, that is, have little redundancy, their joint performance can be better than those of their corresponding single data sources. Moreover, the best performance is achieved with a pair of additional data sources (supplementary Figure (b)), and adding more information sources into this pair cannot further improve the accuracy. The above results and analysis suggest that combining data sources that are redundant does not necessarily improve the overall performance. In other words, in order to gain a better prediction accuracy it is better to combine data sources that provide information from different perspectives of the same biological system.
Results for all four triplets of data sources are shown in supplementary Figure , which all perform better than their corresponding single data sources. It is seen that the best result is obtained by combining conservation, regulatory potential and nucleosome positioning, which accords well with our expectation since "conservation and regulatory potential" is the most informative pair in the lower false positive region (supplementary Figure (f)), and "nucleosome positioning, and regulatory potential" forms the best pair with respect to higher false positive region (supplementary Figure (d)).
Supplementary Figure shows the ROC curve for the only quartet. Although one could expect that adding more information sources into TF target gene prediction always improves the prediction accuracy, our results show that it is not always the case. This finding is understood by realizing the difficulty of combining complex and poorly characterized genome-level data sources into TF target gene prediction.
We have three main contributions in this paper. Firstly, we have developed a new data integration method for TF target gene prediction from multiple data sources. The new method is compared with the one employed in  using a TF target gene prediction algorithm called ProbTF , and the results show that the new data fusion principle improves the previous method by lower false positive rate. Secondly, we have demonstrated the use of two novel information sources, DNA duplex stability and raw nucleosome occupancy predictions from a method proposed in , to guide TF target gene predictions. Our results show that both nucleosome occupancy and DNA stability data can improve TF target gene prediction accuracy especially when combined with evolutionary conservation or with conservation and regulatory potential. Moreover, more accurate nucleosome predictions result in better TF target gene predictions. It is also worth noticing that we do not distinguish different TFs regarding data source usage except for DNA duplex stabilities, where double or single strand binding proteins are treated differently and a heuristic method is adopted to classify them. Thirdly, we have compared all the possible combinations among conservation, regulatory potential, nucleosome positioning and DNA stability, whose results can be availed in data source selection or preparation when dealing with data integration problem in a particular application. We grouped the four tested information sources into two categories based on biological arguments: group 1 contains conservation and regulatory potential, and group 2 consists of nucleosome locations and DNA duplex stability. We found that combining data from different groups is more likely to improve TF target gene predictions presumably because data sources between the two groups are less redundant.
Although the assumption that all TFs bind to DNA in double-strand manner works well in yeast , it may not be sufficient in higher organisms, such as mouse, as shown in this study (see, e.g., Figure 3(a)). Instead, we obtained informative DNA duplex stability prior by assuming different binding preferences for different TFs. We constructed the binding preference of each TF with a simple heuristic which assesses the binding preference for a TF from a set of known binding sites. We have used cross-validation and an additional base pair shifting simulations to show that binding preference prediction and parameter optimization do not result in any (optimistic) bias, or overfitting, in binding prediction accuracy. However, the use of the DNA duplex stability data is limited because little verified information about TF binding specificities can be found from the literature and, therefore, binding specificities need to be learned from the data as well which currently requires that a set of verified binding sites is known. Future research goals include to develop an (unsupervised) algorithm for predicting the binding preference for TFs without prior knowledge of the known binding sites. Moreover, it is possible that one TF may have multiple folding modes, and can bind different sequences with different patterns. For example, MyoD, a member of helix-loop-helix protein family, can not only recognize the double-stranded DNA-binding site (called E-box) in many muscle and nonmuscle genes, but also bind to the noncoding strand of an E-box from the muscle-specific creatine kinase enhancer in a single-stranded manner . To take this possibility into account, a more sophisticated assumption can be applied; that is, TFs can have different binding preferences to different sequences or under different experimental conditions. In this direction, we can also try to incorporate other data sources, such as ChIP-chip data, into our data fusion framework.
Nucleosome positioning data is employed in this study assuming that nucleosomes compete with DNA binding proteins  for target DNA binding sites. Although this assumption is generally true, we could not exclude the possibility that some TFs may selectively bind to nucleosome-occupied regions. Binding sites of such TFs, if exist, can not be recognized by the method presented here when employing nucleosome occupancy data, but can be rescued, for example, by incorporating other information sources.
X. Dai and H. Lähdesmäki designed the study and prepared the paper. O. Yli-Harja participated in the study design. X. Dai developed the new data fusion method, implemented the two novel data sources in TF target gene prediction, and performed all the simulations.
The authors would like to thank Yuan Guo-Cheng for providing us his software for nucleosome occupation prediction. This work was supported by Tampere Graduate School in Information Science and Engineering (TISE) (XFD) and the Academy of Finland (Grant no. 213462).
- Hayashi Y, Sano N, Horikoshi M: A genomic code for nucleosome positioning. Chemtracts 2007, 19(6):223-233.Google Scholar
- Yuan GC, Liu JS: Genomic sequence is highly predictive of local nucleosome depletion. PLoS Computational Biology 2008., 4(1, article e13):Google Scholar
- Blanco E, Farré D, Albà MM, Messeguer X, Guigó R: ABS: a database of Annotated regulatory Binding Sites from orthologous promoters. Nucleic Acids Research 2006, 34: D63-D67. 10.1093/nar/gkj116View ArticleGoogle Scholar
- Wingender E, Chen X, Hehl R, Karas H, Liebich I, Matys V, Meinhardt T, Prüß M, Reuter I, Schacherer F: TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Research 2000, 28(1):316-319. 10.1093/nar/28.1.316View ArticleGoogle Scholar
- Montgomery SB, Griffith OL, Sleumer MC, Bergman CM, Bilenky M, Pleasance ED, Prychyna Y, Zhang X, Jones SJM: ORegAnno: an open access database and curation system for literature-derived promoters, transcription factor binding sites and regulatory variation. Bioinformatics 2006, 22(5):637-640. 10.1093/bioinformatics/btk027View ArticleGoogle Scholar
- MacIsaac KD, Fraenkel E: Practical strategies for discovering regulatory DNA sequence motifs. PLoS Computational Biology 2006, 2(4):e36. 10.1371/journal.pcbi.0020036View ArticleGoogle Scholar
- Berger MF, Philippakis AA, Qureshi AM, He FS, Estep PW III, Bulyk ML: Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nature Biotechnology 2006, 24(11):1429-1435. 10.1038/nbt1246View ArticleGoogle Scholar
- Harbison CT, Gordon DB, Lee TI, Rinaldl NJ, Macisaac KD, Danford TW, Hannett NM, Tagne J-B, Reynolds DB, Yoo J, Jennings EG, Zeitlinger J, Pokholok DK, Kellis M, Rolfe PA, Takusagawa KT, Lander ES, Gilford DK, Fraenkel E, Young RA: Transcriptional regulatory code of a eukaryotic genome. Nature 2004, 430(7004):99-104.View ArticleGoogle Scholar
- Qi Y, Rolfe A, MacIsaac KD, Gerber GK, Pokholok D, Zeitlinger J, Danford T, Dowell RD, Fraenkel E, Jaakkola TS, Young RA, Gifford DK: High-resolution computational models of genome binding events. Nature Biotechnology 2006, 24(8):963-970. 10.1038/nbt1233View ArticleGoogle Scholar
- Narlikar L, Gordân R, Hartemink AJ: A nucleosome-guided map of transcription factor binding sites in yeast. PLoS Computational Biology 2007, 3(11):e215. 10.1371/journal.pcbi.0030215View ArticleMathSciNetGoogle Scholar
- Gordân R, Hartemink AJ: Using DNA duplex stability information for transcription factor binding site discovery. In Proceedings of Pacific Symposium on Biocomputing (PSB '08), 2008. World Scientiffic; 453-464.Google Scholar
- Lähdesmäki H, Rust AG, Shmulevich I: Probabilistic inference of transcription factor binding from multiple data sources. PLoS One 2008., 3(3, article e1820):Google Scholar
- Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Research 2005, 15(8):1034-1050. 10.1101/gr.3715005View ArticleGoogle Scholar
- Taylor J, Tyekucheva S, King DC, Hardison RC, Miller W, Chiaromonte F: ESPERR: learning strong and weak signals in genomic sequence alignments to identify functional elements. Genome Research 2006, 16(12):1596-1604. 10.1101/gr.4537706View ArticleGoogle Scholar
- Benham CJ, Bi C: The analysis of stress-induced duplex destabilization in long genomic DNA sequences. Journal of Computational Biology 2004, 11(4):519-543. 10.1089/cmb.2004.11.519View ArticleGoogle Scholar
- Bi C, Benham CJ: WebSIDD: server for predicting stress-induced duplex destablized (SIDD) sites in superhelical DNA. Bioinformatics 2004, 20(9):1477-1479. 10.1093/bioinformatics/bth304View ArticleGoogle Scholar
- Lee C-K, Shibata Y, Rao B, Strahl BD, Lieb JD: Evidence for nucleosome depletion at active regulatory regions genome-wide. Nature Genetics 2004, 36(8):900-905. 10.1038/ng1400View ArticleGoogle Scholar
- Wasserman WW, Sandelin A: Applied bioinformatics for the identification of regulatory elements. Nature Reviews Genetics 2004, 5(4):276-287. 10.1038/nrg1315View ArticleGoogle Scholar
- Ollis DL, White SW: Structural basis of protein-nucleic acid interactions. Chemical Reviews 1987, 87(5):981-995. 10.1021/cr00081a006View ArticleGoogle Scholar
- Wisedchaisri G, Holmes RK, Hol WGJ: Crystal structure of an IdeR-DNA complex reveals a conformational change in activated IdeR for base-specific interactions. Journal of Molecular Biology 2004, 342(4):1155-1169. 10.1016/j.jmb.2004.07.083View ArticleGoogle Scholar
- Duncan R, Bazar L, Michelotti G, Tomonaga T, Krutzsch H, Avigan M, Levens D: A sequence-specific, single-strand binding protein activates the far upstream element of c-myc and defines a new DNA-binding motif. Genes and Development 1994, 8(4):465-480. 10.1101/gad.8.4.465View ArticleGoogle Scholar
- Finocchiaro LME, Amati P, Glikin GC: Single strand binding protein specific for the polyoma early-coding strand of PEA1 (AP1) regulatory sequence. Nucleic Acids Research 1991, 19(15):4279-4287. 10.1093/nar/19.15.4279View ArticleGoogle Scholar
- Heimberger AB, McGary EC, Suki D, Ruiz M, Wang H, Fuller GN, Bar-Eli M: Loss of the AP-2 α transcription factor is associated with the grade of human gliomas. Clinical Cancer Research 2005, 11(1):267-272.Google Scholar
- Christy B, Nathans D: DNA binding site of the growth factor-inducible protein Zif268. Proceedings of the National Academy of Sciences of the United States of America 1989, 86(22):8737-8741. 10.1073/pnas.86.22.8737View ArticleGoogle Scholar
- Walsh K, Gualberto A: MyoD binds to the guanine tetrad nucleic acid structure. Journal of Biological Chemistry 1992, 267(19):13714-13718.Google Scholar
- Sabourin LA, Rudnicki MA: The molecular regulation of myogenesis. Clinical Genetics 2000, 57(1):16-25.View ArticleGoogle Scholar
- Perkins KJ, Burton EA, Davies KE: The role of basal and myogenic factors in the transciptional activation of utrophin promoter A: implications for therapeutic up-regulation in Duchenne muscular dystrophy. Nucleic Acids Research 2001, 29(23):4843-4850. 10.1093/nar/29.23.4843View ArticleGoogle Scholar
- Chen H, Li B, Workman JL: A histone-binding protein, nucleoplasmin, stimulates transcription factor binding to nucleosomes and factor-induced nucleosome disassembly. EMBO Journal 1994, 13(2):380-390.Google Scholar
- Li Q, Wrange O: Translational positioning of a nucleosomal glucocorticoid response element modulates glucocorticoid receptor affinity. Genes and Development 1993, 7(12A):2471-2482. 10.1101/gad.7.12a.2471View ArticleGoogle Scholar
- Lee W, Tillo D, Bray N, Morse RH, Davis RW, Hughes TR, Nislow C: A high-resolution atlas of nucleosome occupancy in yeast. Nature Genetics 2007, 39(10):1235-1244. 10.1038/ng2117View ArticleGoogle Scholar
- Schones DE, Cui K, Cuddapah S, Roh T-Y, Barski A, Wang Z, Wei G, Zhao K: Dynamic regulation of nucleosome positioning in the human genome. Cell 2008, 132(5):887-898. 10.1016/j.cell.2008.02.022View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.