- Research
- Open Access
Novel methodologies for spectral classification of exon and intron sequences
- Hon Keung Kwan^{1}Email author,
- Benjamin Y M Kwan^{2} and
- Jennifer Y Y Kwan^{3}
https://doi.org/10.1186/1687-6180-2012-50
© Kwan et al; licensee Springer. 2012
- Received: 6 October 2011
- Accepted: 28 February 2012
- Published: 28 February 2012
Abstract
Digital processing of a nucleotide sequence requires it to be mapped to a numerical sequence in which the choice of nucleotide to numeric mapping affects how well its biological properties can be preserved and reflected from nucleotide domain to numerical domain. Digital spectral analysis of nucleotide sequences unfolds a period-3 power spectral value which is more prominent in an exon sequence as compared to that of an intron sequence. The success of a period-3 based exon and intron classification depends on the choice of a threshold value. The main purposes of this article are to introduce novel codes for 1-sequence numerical representations for spectral analysis and compare them to existing codes to determine appropriate representation, and to introduce novel thresholding methods for more accurate period-3 based exon and intron classification of an unknown sequence. The main findings of this study are summarized as follows: Among sixteen 1-sequence numerical representations, the K-Quaternary Code I offers an attractive performance. A windowed 1-sequence numerical representation (with window length of 9, 15, and 24 bases) offers a possible speed gain over non-windowed 4-sequence Voss representation which increases as sequence length increases. A winner threshold value (chosen from the best among two defined threshold values and one other threshold value) offers a top precision for classifying an unknown sequence of specified fixed lengths. An interpolated winner threshold value applicable to an unknown and arbitrary length sequence can be estimated from the winner threshold values of fixed length sequences with a comparable performance. In general, precision increases as sequence length increases. The study contributes an effective spectral analysis of nucleotide sequences to better reveal embedded properties, and has potential applications in improved genome annotation.
Keywords
- DNA sequence
- numerical representation
- nucleotide to numeric mapping
- exon and intron sequences
- coding and non-coding sequences
- threshold value
- thresholding
- exon and intron classification
- period-3
- spectral analysis
- discrete Fourier transform
- gene detection
- genome annotation
1. Introduction
It is known that the coding regions of a nucleotide sequence exhibit a period-3 spectral property because of the presence of codon structure [1], and this property can be used to identify regions of interest. A nucleotide sequence is a discrete sequence consisting of nucleotides C, G, A, and T in which digital spectral analysis can be used to reveal its hidden periodicities, spectral features, and genome structure. Digital spectral analysis is usually carried out by the discrete Fourier transform (DFT) which is a digital signal processing (DSP) technique. Genomic signal processing (GSP) involves the processing and analysis of a digital signal in the form of a numerical sequence mapped from a nucleotide sequence [1, 2]. A general description on the biological aspects of this article can be found in [1, 2]. To perform digital spectral analysis, each nucleotide of a nucleotide sequence has to be converted to a numerical value through a mapping. Such a mapping is called numerical representation and its choice affects how well the biological properties of a nucleotide sequence can be revealed in numerical domain. A nucleotide sequence can be numerically represented in the form of R-sequence, a survey on mappings with R ≥ 1 is given in [3, 4]. 1-sequence numerical representation (with R = 1) is the most compact form of mapping in which one nucleotide is mapped to one fixed numerical value to form a single sequence. 1-sequence numerical representations and their relative performances are studied in this article for unknown exon and intron sequence classification with an improved accuracy.
In this article, nine 1-sequence numerical representations (Codes 1-9) [5–16] are identified though a literature search, which can be grouped under positive-integer-value, real-value, and complex-value numerical representations. In addition, seven 1-sequence complex-value numerical representations (Codes 10-16) are derived. In this article, all these sixteen numerical representations and the Voss representation [17] are compared based on the genomes of twelve organisms (including the human).
List of symbols
Symbol | Description | Symbol | Description |
---|---|---|---|
C | Cytosine | G | Guanine |
A | Adenine | T | Thymine |
DNA | Deoxyribonucleic acid | OG | Organism |
DSP | Digital signal processing | GSP | Genomic signal processing |
DFT | Discrete Fourier transform | FFT | Fast Fourier transform |
SL | Sequence length (bases) | WL | Window length (bases) |
WRS | Window right-shift (bases) | NW | Number of windows |
CDP3 | Cumulative distribution period-3 | CCDP3 | Complementary CDP3 |
PSV | Power spectral value | P _{3} | Period-3 power spectral value |
T _{ m } | Mid threshold value | T _{ p } | Proportional threshold value |
T _{ c } | Cumulative distribution threshold value | T _{ 4 } | Fixed threshold value (= 4.0) |
T _{ w } | Winner threshold value | T _{ i } | Interpolated winner threshold value |
meanP _{ 3e } | Mean of the period-3 values obtained from specified exon sequences | ||
sdP _{ 3e } | Standard deviation of the period-3 values obtained from specified exon sequences | ||
meanP _{ 3i } | Mean of the period-3 values obtained from specified intron sequences | ||
sdP _{ 3i } | Standard deviation of the period-3 values obtained from specified intron sequences |
2. Methods and results
2.1. Numerical representation
List of sixteen numerical representation codes
Name | Code | ||||
---|---|---|---|---|---|
C | G | A | T | ||
1 | Integer Number | 1 | 3 | 2 | 0 |
2 | Single Galois Indicator | 1 | 3 | 0 | 2 |
3 | Paired Nucleotide Atomic Number | 42 | 62 | 62 | 42 |
4 | Atomic Number | 58 | 78 | 70 | 66 |
5 | Molecular Mass | 110 | 150 | 134 | 125 |
6 | EIIP | 0.1340 | 0.0806 | 0.1260 | 0.1335 |
7 | Paired Numeric | -1 | -1 | 1 | 1 |
8 | Real Number | 0.5 | -0.5 | -1.5 | 1.5 |
9 | Complex Number | -1-j | -1+j | 1+j | 1-j |
10 | K-Twin-Pair Code | -1 | -1 | j | j |
11 | K-Bipolar-Pair Code I | -1 | 1 | j | -j |
12 | K-Bipolar-Pair Code II | -1 | 1 | -j | j |
13 | K-Quaternary Code I | -1 | -j | 1 | j |
14 | K-Quaternary Code II | -1 | -j | j | 1 |
15 | K-Quaternary Code III | -j | -1 | 1 | j |
16 | K-Quaternary Code IV | -j | -1 | j | 1 |
It is known that exons are rich in nucleotides C and G and introns are rich in nucleotides A and T [8, 19]. For ease of viewing, C-G and A-T are paired when defining each of the sixteen numerical representations. The first group of numerical representations consists of five positive-integer-value numerical representations (as Codes 1-5 in Table 2) which shall be described as follows: The Integer Number representation [5, 6] is obtained by mapping numerals {1, 3, 2, 0} respectively to the four nucleotides as C = 1, G = 3, A = 2, and T = 0. The Single Galois Indicator representation [7, 8] maps the CGAT nucleotides to a Galois field of four, GF(4), which is formed by assigning numerical values to the nucleotides as C = 1, G = 3, A = 0, and T = 2 in a nucleotide sequence. This representation suggests that C < G and A < T. In the Paired Nucleotide Atomic Number representation [9], the paired nucleotides are assigned with atomic numbers as A, G = 62 and C, T = 42 respectively. In the Atomic Number representation [9], a numerical sequence is formed by assigning atomic numbers to each nucleotide as C = 58, G = 78, A = 70, and T = 66 in a nucleotide sequence. The Molecular Mass representation [10] of a nucleotide sequence is formed by mapping the four nucleotides to their molecular masses as C = 110, G = 150, A = 134, and T = 125, respectively.
The second group of numerical representations consists of three real-value numerical representations (as Codes 6-8 in Table 2) which can be described as: The Electron-Ion Interaction Pseudo-potential (EIIP) represents the distribution of the free electrons energies along a nucleotide sequence. In the EIIP representation, a single EIIP indicator sequence [11, 12] is formed by substituting the EIIP of the nucleotides as C = 0.1340, G = 0.0806, A = 0.1260, and T = 0.1335 in a nucleotide sequence. In the Paired Numeric representation [6, 8], nucleotides (C-G, A-T) are to be paired in a complementary manner and values of -1 and +1 are to be used to denote, respectively, C-G and A-T nucleotide pairs. In the Real Number representation [6, 8, 13], the nucleotide mappings are C = 0.5, G = -0.5, A = -1.5, and T = 1.5, which bears complementary property. The third group consists of one complex-value numerical representation (as Code 9 in Table 2) called the Complex Number representation [2, 5, 6, 8, 14–16], it reflects the complementary nature of C-G and A-T pairs by mapping nucleotides as C = -1-j, G = -1+j, A = 1+j, and T = 1-j.
In addition to the above nine 1-sequence numerical representations, seven 1-sequence numerical representations (listed as Codes 10-16 in Table 2) are derived in which each nucleotide of a sequence is mapped to either a single real-value element (± 1) or a single imaginary-value element (± j). Each of the Codes 10-16, namely, the K-Twin-Pair Code, the K-Bipolar-Pair Codes I and II, and the K-Quaternary Codes I-IV has its equivalent numerical representations. We define a numerical representation R 2 to be an equivalent numerical representation of R 1 if R 2 gives rise to the same power spectrum as that of R 1. An equivalent numerical representation can be obtained by multiplying the numerical represented value of each of the four bases [C, G, A, T] of a nucleotide sequence by the same constant which includes any of -1, and ± j. In particular, a complementary numerical representation obtained by inverting the signs of all the four bases [C, G, A, T] in a numerical representation is an equivalent numerical representation. Other forms of equivalent numerical representation exist, for example, the K-Twin-Pair Code [C, G, A, T] = [-1, -1, j, j] has an equivalent form [C, G, A, T] = [-j, -j, 1, 1]. The K-Quaternary Code III (Code 15) [C, G, A, T] = [-j, -1, 1, j] is identical to the pentanary code in [20] (with [20] specifies that x[n] = 0 for an unknown nucleotide at position n) which was derived for the three-dimensional DNA walk for graphical representation of nucleotide sequences. The K-Bipolar-Pair Code II (Code 12) [C, G, A, T] = [-1, 1, -j, j] has an equivalent numerical representation of [C,G,A,T] = [-j, j, 1, -1] adopted in [21]. The K-Quaternary Code IV (Code 16) [C, G, A, T] = [-j, -1, j, 1] has a complementary numerical representation of [C, G, A, T] = [j, 1, -j, -1] mentioned in [22]. In [23], complex numerical representations with simultaneous non-zero real part and non-zero imaginary part were adopted which are different from those of the Codes 10-16 in which either a real value or an imaginary value is used for each nucleotide to numeric mapping. It should be noted that the numerical representations in [21–23] were formulated for sequence alignment but not for period-3 power spectral analysis.
The Voss representation (i.e., the Code 17) is a commonly used numerical representation for period-3 spectral classification of exon and intron sequences [1, 2, 17, 24, 25]. The Voss representation is a 4-sequence representation (R = 4) in which each of the four nucleotides of a nucleotide sequence is represented by a separate numerical sequence (as C-sequence, G-sequence, A-sequence, and T-sequence) such that the n^{th} position of the C-sequence is coded by 1 if the n^{th} nucleotide of the sequence is C, otherwise it is coded by 0. In a similar manner, the coding procedure just described applies to the remaining three numerical sequences. A threshold value can be determined from the cumulative distributions of the signal-to-noise ratio of the peak at f = 1/3 for each set of exon and intron sequences [17]. Using the Voss representation, two known threshold values (T_{ c }, T_{4}) are adopted for comparison in this article. The T_{ c } thresholding is determined from the exon and intron cumulative distribution functions [17] and the T_{4} thresholding is set to a fixed value four [17]. The Codes 1-16 are to be compared to the Voss representation (Code 17) in Section 2.4.
2.2. Spectral analysis
2.3. Winner threshold value
In both definitions, the exon cluster is centred at meanP_{ 3e } and the intron cluster is centred at meanP_{ 3i } . In Equation 8, the mid threshold value is determined by the mid-point between the exon cluster and the intron cluster, whereas in Equation 9, the proportional threshold value is determined in proportion to the standard deviations of the two clusters.
We define a winner threshold value, T_{ w } , as the threshold value chosen among T_{ m }, T_{ p } , and T_{ c } that yields the top classification (or the highest precision).
2.4. Classification and speed performances of codes 1-17
In this article, all the exon and intron sequences of the human and eleven model organisms were downloaded from the UCSC Genomes [26–29] as inputs to be processed by Matlab programs.
2.4.1. Classification performance of short sequences
UCSC Human genome consisting of 2 short sequence groups.
Short sequence group I | Short sequence group II | ||||
---|---|---|---|---|---|
SL | Type | Number | SL | Type | Number |
50 | Exon | 542910 | 90 | Exon | 4705 |
Intron | 653640 | Intron | 1495 | ||
100 | Exon | 379835 | 120 | Exon | 4582 |
Intron | 621266 | Intron | 723 | ||
150 | Exon | 195133 | 180 | Exon | 2355 |
Intron | 588070 | Intron | 427 | ||
200 | Exon | 90211 | 210 | Exon | 1188 |
Intron | 567031 | Intron | 436 | ||
250 | Exon | 51626 | 270 | Exon | 612 |
Intron | 548406 | Intron | 271 | ||
300 | Exon | 35685 | 330 | Exon | 147 |
Intron | 532489 | Intron | 322 | ||
350 | Exon | 28497 | 390 | Exon | 122 |
Intron | 519074 | Intron | 242 | ||
400 | Exon | 24007 | 420 | Exon | 126 |
Intron | 506839 | Intron | 240 | ||
450 | Exon | 20957 | 480 | Exon | 65 |
Intron | 495007 | Intron | 263 | ||
500 | Exon | 18492 | 510 | Exon | 54 |
Intron | 483412 | Intron | 272 | ||
550 | Exon | 16605 | 570 | Exon | 75 |
Intron | 472689 | Intron | 214 | ||
600 | Exon | 14822 | 630 | Exon | 25 |
Intron | 462572 | Intron | 189 | ||
650 | Exon | 13340 | |||
Intron | 452529 |
Code (Figure 6b, bottom), WL (bases) (Figure 7a, top), threshold method (Figure 7b, middle), and precision (%) (Figure 6a, top) of top classifications of 13 short sequence sets
SL | Code | WL | Threshold | Precision |
---|---|---|---|---|
50 | 17 | 9 | T _{ c } | 67.5412 |
50 | 13 | 9 | T _{ p } | - |
100 | 13 | 9 | T _{ p } | 73.2234 |
150 | 17 | 150 | T _{ c } | 76.8066 |
150 | 10 | 150 | T _{ m } | - |
200 | 17 | 200 | T _{ c } | 77.8411 |
200 | 13 | 9 | T _{ c } | - |
250 | 13 | 9 | T _{ p } | 79.1004 |
300 | 17 | 300 | T _{ c } | 83.8231 |
300 | 13 | 300 | T _{ p } | - |
350 | 13 | 9 | T _{ p } | 84.1829 |
400 | 13 | 9 | T _{ p } | 84.2579 |
450 | 13 | 15 | T _{ p } | 84.3628 |
500 | 13 | 9 | T _{ c } | 86.2519 |
550 | 17 | 550 | T _{ c } | 89.1604 |
550 | 13 | 9 | T _{ c } | - |
600 | 13 | 9 | T _{ c } | 89.7151 |
650 | 13 | 15 | T _{ c } | 91.2744 |
2.4.2. Classification performance of long sequences
UCSC Human genome consisting of 2 long sequence groups
Long sequence group I | Long sequence group II | ||||
---|---|---|---|---|---|
SL | Type | Number | SL | Type | Number |
650 | Exon | 13340 | 690 | Exon | 35 |
Intron | 452529 | Intron | 178 | ||
700 | Exon | 12103 | 720 | Exon | 25 |
Intron | 443645 | Intron | 205 | ||
750 | Exon | 11075 | 780 | Exon | 37 |
Intron | 434145 | Intron | 191 | ||
800 | Exon | 10132 | 810 | Exon | 26 |
Intron | 425427 | Intron | 125 | ||
850 | Exon | 9312 | 870 | Exon | 5 |
Intron | 417108 | Intron | 119 | ||
900 | Exon | 8587 | 930 | Exon | 42 |
Intron | 409022 | Intron | 177 | ||
950 | Exon | 7608 | 990 | Exon | 24 |
Intron | 401490 | Intron | 110 | ||
1000 | Exon | 6846 | 1020 | Exon | 14 |
Intron | 393725 | Intron | 217 | ||
1050 | Exon | 6340 | 1080 | Exon | 31 |
Intron | 385722 | Intron | 143 | ||
1100 | Exon | 5726 | 1110 | Exon | 12 |
Intron | 378374 | Intron | 160 | ||
1150 | Exon | 5276 | 1170 | Exon | 12 |
Intron | 371372 | Intron | 174 | ||
1200 | Exon | 4953 | 1230 | Exon | 14 |
Intron | 364598 | Intron | 106 | ||
1250 | Exon | 4583 | |||
Intron | 357831 |
Code (Figure 8b, bottom), WL (bases) (Figure 9a, top), threshold method (Figure 9b, middle), and precision (%) (Figure 8a, top) of top classifications of 13 long sequence sets
SL | Code | WL | Threshold | Precision |
---|---|---|---|---|
650 | 13 | 15 | T _{ c } | 90.1396 |
700 | 13 | 9 | T _{ c } | 90.4014 |
750 | 13 | 24 | T _{ 4 } | 91.2304 |
800 | 13 | 15 | T _{ 4 } | 92.8447 |
850 | 13 | 15 | T _{ 4 } | 93.2810 |
900 | 13 | 9 | T _{ p } | 92.4084 |
950 | 13 | 9 | T _{ p } | 92.4520 |
1000 | 13 | 24 | T _{ p } | 91.7103 |
1050 | 13 | 15 | T _{ p } | 92.1902 |
1100 | 1 | 24 | T _{ 4 } | 93.5428 |
1100 | 13 | 9 | T _{ p } | - |
1150 | 13 | 9 | T _{ p } | 94.6771 |
1200 | 13 | 1200 | T _{ c } | 94.9825 |
1250 | 13 | 9 | T _{ p } | 94.9825 |
2.4.3. Speed performance
2.5. Interpolated winner threshold value
2.5.1. Short sequences
2.5.2. Long sequences
2.6. Classification performances of 12 organisms
UCSC genome of 12 organisms
OG | Clade | Genome | Assembly | Track | Table | Type | Number |
---|---|---|---|---|---|---|---|
1 | Mammal | Human | Feb. 2009 (GRCh37/hg19) | UCSC Genes | knownGene | Exon Intron | 35685 532489 |
2 | Mammal | Mouse | July 2007 (NCBI37/mm9) | UCSC Genes | knownGene | Exon Intron | 27452 43041 |
3 | Mammal | Pig | Nov. 2009 (SGSC Sscrofa9.2/susScr2) | Ensembl Genes | ensGene | Exon Intron | 11499 113946 |
4 | Vertebrate | Chicken | May 2006 (WUGSC 2.1/galGal3) | Ensembl Genes | ensGene | Exon Intron | 12045 57770 |
5 | Vertebrate | Zebrafish | Jul. 2010 (Zv9/danRer7) | RefSeq Genes | refGene | Exon Intron | 7838 78776 |
6 | Deuterostome | C. intestinalis | Mar. 2005 (JGI 2.1/ci2) | Ensembl Genes | ensGene | Exon Intron | 7360 86885 |
7 | Deuterostome | S. purpuratus | Sep. 2006 (Baylor 2.1/strPur2) | Other RefSeq | xenoRefGene | Exon Intron | 4888 193451 |
8 | Insect | D. melanogaster | Apr. 2006 (BDGP R5/dm3) | RefSeq Genes | refGene | Exon Intron | 37925 35403 |
9 | Insect | A. mellifera | Jan. 2005 (Baylor 2.0/apiMel2) | Ensembl Genes | ensGene | Exon Intron | 21827 47605 |
10 | Nematode | C. elegans | May 2008 (WS190/ce6) | RefSeq Genes | refGene | Exon Intron | 25360 37359 |
11 | Nematode | C. japonica | Mar. 2008 (WUGSC 3.0.2/caeJap1) | Other RefSeq | xenoRefGene | Exon Intron | 7978 30958 |
12 | Other | Sea hare | Sept. 2008 (Broad 2.0/aplCal1) | Other RefSeq | xenoRefGene | Exon Intron | 6914 431792 |
Code index (Figure 17b, bottom), WL (bases) (Figure 18a, top), threshold method (Figure 18b, middle), and precision (%) (Figure 17a, top) of top classifications of 12 organisms (SL = 300 bases)
OG | Code | WL | Threshold | Precision |
---|---|---|---|---|
1 | 13 | 24 | T _{ p } | 80.8281 |
2 | 13 | 300 | T _{ m } | 82.9433 |
3 | 13 | 24 | T _{ p } | 83.6184 |
4 | 13 | 24 | T _{ p } | 80.1530 |
5 | 17 | 300 | T _{ c } | 83.3933 |
5 | 13 | 300 | T _{ p } | - |
6 | 6 | 300 | T _{ p } | 82.6283 |
7 | 13 | 24 | T _{ 4 } | 94.9145 |
8 | 17 | 300 | T _{ c } | 93.2943 |
8 | 13 | 9 | T _{ p } | - |
9 | 17 | 300 | T _{ c } | 88.7039 |
9 | 10 | 15 | T _{ c } | - |
10 | 17 | 300 | T _{ c } | 87.3087 |
11 | 13 | 15 | T _{ c } | 82.3132 |
12 | 13 | 15 | T _{ c } | 91.8992 |
3. Discussion
In this article, the ability of the K-Quaternary Code I (the Code 13) through the use of the discrete Fourier transform to capture the periodicities of an exon sequence or an intron sequence across the entire spectrum at a resolution defined by the window length is shown in Figures 3 and 4. Such a spectral tracking ability is shared among all of the sixteen numerical representations. As seen from Figures 3 and 4, there are three prominent peaks located at frequency equal to 0, 2π/3, and 4π/3. The peak at 2π/3 corresponds to the period-3 power spectral value, and the peak at 0 corresponds to the power spectral value at DC which usually exhibits the highest value within a spectrum. The power spectral values at other frequencies are lower and different but all serve to reflect their actual power spectral properties across the spectrum. For the Codes 9-16, their numerical represented sequences x[n] are complex; therefore, DFT spectrum shows non-symmetrical peaks at frequency equal to 2π/3 and 4π/3. However, for the Codes 1-8 in which their numerical represented sequences x[n] are real and therefore symmetrical peaks at frequency equal to 2π/3 and 4π/3 are obtained from DFT analysis.
The interpolated winner threshold values, T_{ i } (P_{3}), and their corresponding precisions shown in Figures 13 and 14 for short sequences, and in Figures 15 and 16 for long sequences of the human genome indicate that T_{ i } (P_{3}) obtained using cubic spline interpolation from either Figure 11 or Figure 12 can yield a similar precision as compared to that of the winner threshold value, T_{ w } (P_{3}), of its nearby SL. It should be noted that each of T_{ m } (P_{3}), T_{ p } (P_{3}), and T_{ c } (P_{3}) required to determine T_{ w } (P_{3}) has to be computed directly from the training portion of each short/long sequence set whereas T_{ i } (P_{3}) requires minimal computation but can achieve a comparable precision.
Given an unknown human nucleotide sequence of an arbitrary length L, if L is equal to the length of any of the thirteen short sequence sets or any of the thirteen long sequence sets, the choice of a suitable set of code, WL, and threshold can be obtained as a table-look-up from either Table 4 or 6. If L is not equal to the length of any of these thirteen short or long sequence sets, the closest SL, its code, and WL can also be obtained from either Table 4 or 6. For the latter case, once the code and WL are determined, its threshold value can then be determined using the interpolated winner threshold described in Section 2.5. Besides the human genome, the methodologies described in this article can be applied to the genome of other model organisms as verified by the results shown in Figures 17(a-b) and 18 (a-c) and Tables 7 and 8.
4. Conclusions
In this article, two methods for determining threshold values have been defined, and together with the cumulative distribution threshold value, determine the winner threshold value for classifying an unknown nucleotide sequence of a fixed length. An interpolated winner threshold value has also been introduced to classify an unknown nucleotide sequence of an arbitrary length with a comparable performance to that obtained by the winner threshold value of its nearby SL (in classifying an unknown nucleotide sequence of a fixed length). In general, precision increases as sequence length increases, and classification performance depends on a suitable choice of numerical representation and window length. Sixteen 1-sequence numerical representations have been presented and compared to classify untrained exon and intron sequences in the spectral domain, in which the K-Quaternary Code I yields attractive performance. When comparing each of the sixteen windowed 1-sequence numerical representations using WL = [9,15,24] bases to a non-windowed 4-sequence numerical representation (such as the Voss representation), the speed improvement ratio increases as SL increases which favours long nucleotide sequence analysis. The results obtained indicate the methodologies introduced in this article for exon and intron sequence classification are applicable to the genomes of the human and other model organisms. Overall, the study has developed novel methodologies in numerical representation for improved nucleotide to numeric mapping, spectral analysis for effective period-3 spectral value computation, and thresholding for more accurate classification of unknown exon and intron sequences of fixed and arbitrary length.
Declarations
Authors’ Affiliations
References
- Vaidyanathan PP: Genomics and proteomics: A signal processor's tour. IEEE Circ. Syst. Mag. 4th Q 2004, 6-29.Google Scholar
- Anastassiou D: Genomic signal processing. IEEE Signal Process. Mag. 2001, 18: 8-20.View ArticleGoogle Scholar
- Kwan HK, Arniker SB: Numerical representation of DNA sequences. 307-310.Google Scholar
- Arniker SB, Kwan HK: Graphical representation of DNA sequences. 311-314.Google Scholar
- Cristea PD: in BIOS’02: Genetic Signal Representation and Analysis , SPIE International Conference on Biomedical Optics Symposium, Molecular Analysis and Informatics, San Jose, CA, USA, 21-24 January 2002. Proceedings of Society of Photo-Optical Instrumentation Engineers (SPIE) Conference, vol. 4623 (SPIE, 2002). 77-84.Google Scholar
- Akhtar M, Epps J, Ambikairajah E: On DNA numerical representations for period-3 based exon prediction. 4 pages in Proceedings of IEEE Workshop on Genomic Signal Processing and Statistics (GENSIPS), Tuusula, Finland, 10-12 June 2007Google Scholar
- Rosen GL: Signal Processing for Biologically-inspired Gradient Source Localization and DNA Sequence Analysis. PhD dissertation, Georgia Institute of Technology, Atlanta; 2006.Google Scholar
- Akhtar M, Epps J, Ambikairajah E: Signal processing in sequence analysis: Advances in eukaryotic gene prediction. IEEE J Sel Top Signal Process 2008, 2: 310-321.View ArticleGoogle Scholar
- Holden T, Subramaniam R, Sullivan R, Cheng E, Sneider C, Tremberger G Jr, Flamholz A, Leiberman DH, Cheung TD: ATCG nucleotide fluctuation of Deinococcus radiodurans radiation genes. In Instruments, Methods, and Missions for Astrobiology X. Proceedings of Society of Photo-Optical 2 Instrumentation Engineers (SPIE) Conference, vol. 6694 (SPIE, 12 September 2007) Edited by: B Hoover, GV Levin, AY Rozanov, and PCW Davies. 669417-1-669417-10.Google Scholar
- Stanley HE, Buldyrev SV, Goldberger AL, Goldberger ZD, Havlin S, Ossadnik SM, Peng C-K, Simmons M: Statistical mechanics in biology: how ubiquitous are long-range correlations? Physica A 1994, 205: 214-253. 10.1016/0378-4371(94)90502-9View ArticleGoogle Scholar
- Nair AS, Pillai SS: A coding measure scheme employing electron-ion interaction pseudo potential (EIIP). Bioinformation 2006, 1: 197-202.Google Scholar
- Cosic I: Macromolecular Bioactivity: is it resonant interaction between macromolecules? Theory and applications. IEEE Trans Biomed Eng 1994, 41: 1101-1114. 10.1109/10.335859View ArticleGoogle Scholar
- Chakravarthy N, Spanias A, Lasemidis LD, Tsakalis K: Autoregressive modeling and feature analysis of DNA sequences. EURASIP Journal of Genomic Signal Processing 2004, 1: 13-28.View ArticleMATHGoogle Scholar
- Cristea PD: Conversion of nucleotides sequences into genomic signals. J Cell Mol Med 2002, 6: 279-303. 10.1111/j.1582-4934.2002.tb00196.xView ArticleGoogle Scholar
- Cristea PD: Representation and analysis of DNA sequences. In Genomic Signal Processing and Statistics, EURASIP Book Series in Signal Processing and Communications, volume 2 (Hindawi Publishing Corporation, 2005) Edited by: ER Dougherty, I Shmulevich, J Chen, ZJ Wang. 15-65.Google Scholar
- Brodzik AK, Peters O: Symbol-balanced Quaternionic periodicity transform for latent pattern detection in DNA sequences. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 5, Philadelphia, USA, March 2005 373-376.Google Scholar
- Tiwari S, Ramachandran S, Bhattacharya A, Bhattacharya S, Ramaswamy R: Prediction of probable genes by Fourier analysis of genomic sequences. Bioinformatics (CABIOS) 1997, 13(3):263-270. 10.1093/bioinformatics/13.3.263View ArticleGoogle Scholar
- Datta S, Asif A: A fast DFT based gene prediction algorithm for identification of protein coding regions. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 5, Philadelphia, USA, March 2005 653-656.Google Scholar
- Kwan BYM, Kwan JYY, Kwan HK, Atwal R, Shen OT: Wavelet analysis of the genome of the model plant Arabidopsis thaliana. In 4 pages in Proceedings of TENCON, Hong Kong, China, 2006. ;Google Scholar
- Berger JA, Mitra SK, Carli M, Neri A: New approaches to genome sequence analysis based on digital signal processing. 4 pages in Proceedings of IEEE Workshop on Genomic Signal Processing and Statistics (GENSIPS), Raleigh, North Carolina, October 2002Google Scholar
- Cheever EA, Searls DB, Karunaratne W, Overton GC: Using signal processing techniques for DNA sequence comparison. 173-174.Google Scholar
- Rajasekaran S, Nick H, Pardalos PM, Sahni S, Shaw G: Efficient algorithms for local alignment search. J Comb Optim 2001, 5: 117-124. 10.1023/A:1009893719470MathSciNetView ArticleMATHGoogle Scholar
- Avenio GD, Grigioni M, Orefici G, Creti R: SWIFT (sequence-wide investigation with Fourier transform): a software tool for identifying proteins of a given class from the unannotated genome sequence. Bioinformatics 2005, 21(13):2943-2949. 10.1093/bioinformatics/bti468View ArticleGoogle Scholar
- Yin CC, Yau SS-T: Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. J Theor Biol 2007, 247: 687-694. 10.1016/j.jtbi.2007.03.038MathSciNetView ArticleGoogle Scholar
- Tuqan J, Rushdi A: A DSP approach for finding the codon bias in DNA sequences. IEEE J Sel Topics Signal Process 2008, 2(3):343-356.View ArticleGoogle Scholar
- Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, Kent WJ: The UCSC Table Browser data retrieval tool. Nucleic Acids Res 2004, (32, Database issue):D493-496. 10.1093/nar/gkh103Google Scholar
- Goecks J, Nekrutenko A, Taylor J, The Galaxy Team: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology 2010., 11(8): Article R86 10.1186/gb-2010-11-8-r86Google Scholar
- Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J: Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol 2010, Chapter 19(Unit 19.10):1-21.Google Scholar
- Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A: Galaxy: a platform for interactive large-scale genome analysis. Genome Res 2005, 15(10):1451-1455. 10.1101/gr.4086505View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.