- Research Article
- Open Access

# An MCMC Algorithm for Target Estimation in Real-Time DNA Microarrays

- Haris Vikalo
^{1}Email author and - Mahsuni Gokdemir
^{1}

**2010**:736301

https://doi.org/10.1155/2010/736301

© H. Vikalo and M. Gokdemir. 2010

**Received:**1 February 2010**Accepted:**15 July 2010**Published:**2 August 2010

## Abstract

DNA microarrays detect the presence and quantify the amounts of nucleic acid molecules of interest. They rely on a chemical attraction between the target molecules and their Watson-Crick complements, which serve as biological sensing elements (probes). The attraction between these biomolecules leads to binding, in which probes capture target analytes. Recently developed real-time DNA microarrays are capable of observing kinetics of the binding process. They collect noisy measurements of the amount of captured molecules at discrete points in time. Molecular binding is a random process which, in this paper, is modeled by a stochastic differential equation. The target analyte quantification is posed as a parameter estimation problem, and solved using a Markov Chain Monte Carlo technique. In simulation studies where we test the robustness with respect to the measurement noise, the proposed technique significantly outperforms previously proposed methods. Moreover, the proposed approach is tested and verified on experimental data.

## Keywords

- Markov Chain Monte Carlo
- Stochastic Differential Equation
- Gibbs Sampler
- Importance Sampler
- Transition Density

## 1. Introduction

Molecular biosensors [1] are devices that contain a biological sensing element closely coupled with a transducer. They measure interaction of biomolecules of interest (*target analytes*) with the biological sensing element, and generate signal proportional to the amount of the analyte molecules. Detection in affinity biosensors [2] relies on chemical attraction between target analytes and their molecular complements, which serve as biological sensing elements (*probes*). The attraction between these biomolecules (their *affinity* for each other) leads to binding, in which probes capture target analytes. For instance, nucleic acid probes (DNA, RNA, or synthetic oligonucleotides) capture their Watson-Crick complements, antibody probes capture antigens, cell receptor probes capture ligands, and so forth. A transducer then converts the number of complex molecular structures that are formed due to the binding into a signal. Affinity biosensors can be multiplexed, which led to the development of microarrays—arrays of affinity biosensors capable of testing a large number of analytes simultaneously. DNA microarrays [3], in particular, are capable of screening tens or even hundreds of thousands of different gene sequences at the same time, revealing critical information about the functionality of cells, effects of drugs on organisms, and so forth. Microarrays are time- and cost-efficient, and may enable exciting new applications in drug discovery, medicine, defense systems, and environmental monitoring.

Despite their enormous potential, however, microarrays have not fully met the expectations of the research community and industry. Although in principle reliable [4], their performance still leaves something to be desired [5, 6]. Today, the sensitivity, dynamic range, and resolution of DNA microarrays are limited by interference, noise, probe saturation, and other sources of errors in the analyte detection procedure. Several of these limitations stem from the fact that the molecular binding is a stochastic process, which many of the conventional affinity biosensors attempt to characterize based on a single measurement of its equilibrium, that is, by taking one sample from the steady-state distribution of the binding process. On the other hand, *real-time* DNA microarrays are capable of taking multiple temporal samples of a binding process [7–9]. However, analyte estimation therein is typically performed using only the data collected in the equilibrium, and rarely relies on the kinetics [10].

In [11], analyte targets in real-time DNA microarrays are estimated using the temporally sampled kinetics of the binding process. However, the kinetics process there is described using a deterministic model. In this paper, we propose a comprehensive stochastic model of the binding process and state a Markov Chain Monte Carlo (MCMC) algorithm for the estimation of the target analytes. The performance of the proposed algorithm is tested on both synthetic and experimental data.

The paper is organized as follows. In Section 2, we describe the stochastic differential equation modeling the probe-target binding process. In Section 3, parameter estimation in discretely sampled diffusion processes is described, assuming noiseless data acquisition. An MCMC algorithm for the parameter estimation in the realistic noisy scenario is discussed in Section 4. Section 5 shows simulation results, while the experimental verification is provided in Section 6. Section 7 concludes the paper and outlines future work.

## 2. Stochastic Model

where is the probability of release of a captured analyte molecule, and where denotes the disassociation rate (for more details see, e.g., [12]).

and where denotes the Wiener process (detailed derivation is in [12]).

Real-time DNA microarrays collect noisy observations of the temporally sampled diffusion process (4). Ultimately, we would like to use the collected observations to estimate parameters of the model (including , the number of target molecules). A survey of techniques for parameter estimation of discretely observed diffusion processes is given in [14]. These techniques include (i) estimating functions [15]; (ii) indirect inference and efficient method of moments [16]; (iii) Bayesian analysis and Markov Chain Monte Carlo (MCMC) methods [17–20]; (iv) analytical and numerical approximation of the likelihood function [21–23]. For Bayesian analysis and the MCMC methods, the SDE is first discretized in-sync with the measurements, using time increments equal to the sampling period of the measurements. Additional time points are introduced between the samples [24], and the corresponding values of are treated as missing data points. The MCMC techniques [25] are then used to generate the missing data points. We should point out that MCMC techniques may be employed to estimate parameters in fairly general SDE models where the drift and diffusion coefficients are allowed to be nonlinear functions of diffusion process, or where parameters may enter into these coefficients nonlinearly. This is the case for the SDE model of real-time biosensor arrays (4).

for some positive constant (see, e.g., [26]). For the sake of clarity of presentation, in the next section we first consider the noise-free case. Then, in the following section, we turn our attention to the noisy case.

## 3. Parameter Estimation in the Noise-Free Case

where , and then find by maximizing . The challenge, however, is that , a closed form expression for the transitional density between two consecutive discrete observation points is unavailable for the system in (4). Therefore, the likelihood function is often approximated via various numerical techniques [27, 28]. Here we describe the data augmentation procedure.

, where , and where .

In this equations, we are using the fact that is a Markov process to write the joint distribution as a product of marginal distributions. We are generating sample paths of the on the time interval to approximate the transition density. Now, we must construct efficient importance samplers to draw the missing samples .

for each

To summarize, in each time interval we perform the following steps.

- (1)
Starting from , employ the Euler-Maruyama technique (10) to generate samples of the process at . These samples are denoted by , .

- (2)
Use to estimate the transition density according to (14).

Finally, is maximized over . For large , , the resulting approaches the true ML estimate of .

To lower the computational complexity of the approach described in this section, various modifications have been proposed. For instance, alternative importance samplers are employed to accelerate the convergence of the Monte Carlo integration, resulting in significant computational savings (see, e.g., [30] and the references therein). We shall not pursue these alternative importance samplers here. Instead, we switch our attention to the estimation problem in the noisy measurement case.

## 4. An MCMC Algorithm for Parameter Estimation in Noisy Case

where denotes iid Gaussian noise , and where is introduced for notational convenience. (Note that for the sake of simplicity we set the transduction coefficient in the measurement equation to .) Let denote the set of collected noisy observations, where . Furthermore, we denote and collect the points into . (Note that is a noisy observation of .)

and where , . We rely on the Gibbs sampling technique to draw the missing data conditioned on the current state of the parameters and observations, and draw the parameters conditioned on the simulated missing data and observations. This procedure generates a Markov chain whose stationary distribution is (18). Expressed algorithmically, we perform the following steps.

- (1)
Initialize parameters and latent values. Use linear interpolation between the measured points in to initialize . Set the iteration counter to .

- (2)
In the iteration , draw .

- (3)
Draw via Gaussian random walk update.

- (4)
Set and go to step 2.

Finding the analytical expressions of the distributions in steps 2 and 3 appears infeasible. Hence, we employ the Metropolis-Hasting (M-H) algorithm to compute them numerically. In step 2, we generate a single component of (i.e., ) at a time (the so-called single site update), where there are four different cases depending on the value of the time index . Case 1 deals with drawing the missing data for which there are no corresponding noisy observations in (i.e., is not an integer multiple of ). On the other hand, Cases 2–4 deal with drawing the missing data for which we do acquire noisy measurements. Among these, Cases 3 and 4 deal with the missing data at the start and at the end of the binding process, respectively (i.e., the boundary points corresponding to and ). Case 2 deals with drawing the remaining missing data (i.e., is an integer multiple of , ).

Case 1.

Direct sampling from this distribution is not feasible. Therefore, we need to employ the M-H algorithm.

However, we need to consider a more general case where drift and diffusion coefficients are functions of parameters and the diffusion process (clearly, this is the case for our model).

Here, is the value at the iteration and is the value obtained at iteration of the Gibbs Sampler.

Case 2 ( is an integer multiple of , , ).

Case 3 ( ).

Case 4 ( ).

In this case, we can directly sample from the above density, so there is no need for the M-H algorithm.

## 5. Simulation Results

## 6. Experimental Verification

To verify the proposed approach in experiments, we used the real-time microarray data reported in [11]. In those experiments, cDNA targets were generated from The RNA Spikes, a commercially available set of 8 purified *Escherichia Coli* RNA transcripts purchased from Ambion Inc. Lengths of the RNA sequences in the set are (750, 752, 1000, 1000, 1034, 1250, 1475, 2000), respectively. The RNA sequences were reverse transcribed to obtain the cDNA targets, which were then labeled with Cy5 dyes. Eight probes (25 mer oligonucleotides) were designed and printed on slides, where each probe was repeated in 6 different spots; hence, the printed slides had 48 spots. We focus on two experiments, one where the concentrations of the targets was 80 ng/50
L, and the other where the concentrations of the targets was 16 ng/50
l.

where and .

Moreover, since the noise variance is generally unknown, we add it to the vector of the unknown parameters, that is, . This requires slight modification in the step 3 of the MCMC algorithm (as described previously).

## 7. Summary and Conclusion

In this paper, we considered the problem of estimating the number of target molecules in stochastically modeled biomolecular sensors. We posed it as a parameter estimation problem in systems modeled by stochastic differential equations, where the noise-perturbed data is acquired at discrete points in time. Since the problem is analytically intractable, we employed MCMC techniques to obtain a numerical solution. In particular, we relied on the use of the Gibbs Sampler to alternate between drawing missing data conditioned on parameters and observations, and drawing parameters conditioned on the simulated missing data and the observations. We used the Metropolis-Hastings technique within the Gibbs Sampler to simulate analytically untractable densities. Simulation results indicate that the proposed algorithm significantly outperforms the existing least-mean-squares approach, and that the algorithm is robust with respect to the measurement noise. Moreover, we applied the algorithm to experimental data to verify the validity of the estimation algorithm in a realistic scenario.

There are several possible extensions of the current work. For instance, the MCMC algorithm described in this paper can also be applied to multivariate diffusion processes. Such processes arise in the context of gene regulatory network as well as in real-time biosensor arrays affected by cross-hybridization. For this scenario, one may extend the algorithm so that it handles unobserved parts of a multivariate diffusion process. On another note, a variation of the MCMC algorithm performs (random) block updating (see, e.g., [19, 20]). It is worth pursuing this modification in the context of parameter estimation in real-time biosensors.

## Authors’ Affiliations

## References

- Cooper J, Cass T (Eds):
*Biosensors*. 2nd edition. Oxford University Press, Oxford, UK; 2004.Google Scholar - Rogers KR, Mulchandani A:
*Affinity Biosensors*. Humana Press, Totowa, NJ, USA; 1998.View ArticleGoogle Scholar - M. Schena KR:
*Microarray Analysis*. John Wiley & Sons, New York Ny, USA; 2003.Google Scholar - Shi L, Reid LH, Jones WD,
*et al*.: The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements.*Nature Biotechnology*2006, 24(9):1151-1161. 10.1038/nbt1239View ArticleGoogle Scholar - Marshall E: Getting the noise out of gene arrays.
*Science*2004, 306(5696):630-631. 10.1126/science.306.5696.630View ArticleGoogle Scholar - Draghici S, Khatri P, Eklund AC, Szallasi Z: Reliability and reproducibility issues in DNA microarray measurements.
*Trends in Genetics*2006, 22(2):101-109. 10.1016/j.tig.2005.12.005View ArticleGoogle Scholar - Stimpson DI, Hoijer JV, Hsieh W, Jou C, Gordon J, Theriault T, Gamble R, Baldeschwieler JD: Real-time detection of DNA hybridization and melting on oligonucleotide arrays by using optical wave guides.
*Proceedings of the National Academy of Sciences of the United States of America*1995, 92(14):6379-6383. 10.1073/pnas.92.14.6379View ArticleGoogle Scholar - Bishop J, Chagovetz AM, Blair S: Kinetics of multiplex hybridization: mechanisms and implications.
*Biophysical Journal*2008, 94(5):1726-1734. 10.1529/biophysj.107.121459View ArticleGoogle Scholar - Henry MR, Stevens PW, Sun J, Kelso DM: Real-time measurements of DNA hybridization on microparticles with fluorescence resonance energy transfer.
*Analytical Biochemistry*1999, 276(2):204-214. 10.1006/abio.1999.4344View ArticleGoogle Scholar - Mirsky VM: Affinity sensors in non-equilibrium conditions: highly selective chemosensing by means of low selective chemosensors.
*Sensors*2001, 1(1):13-17. 10.3390/s10100013View ArticleGoogle Scholar - Vikalo H, Hassibi B, Hassibi A: Modeling and estimation for real-time microarrays.
*IEEE Journal on Selected Topics in Signal Processing*2008, 2(3):286-296.View ArticleMATHGoogle Scholar - Das S, Vikalo H, Hassibi A: On scaling laws of biosensors: a stochastic approach. Journal of Applied Physics 2009., 105(10):Google Scholar
- Allen E:
*Modeling with Itô Stochastic Differential Equations*. Springer, New York, NY, USA; 2007.MATHGoogle Scholar - Sørensen H: Parametric inference for diffusion processes observed at discrete points in time: a survey.
*International Statistical Review*2004, 72(3):337-354.View ArticleGoogle Scholar - Bibby BM: Estimating functions for discretely sampled diffusion type models. In
*Handbook of Financial Econometrics*. North-Holland, Amsterdam, The Netherlands; 2002.Google Scholar - Gallant AR, Long JR: Estimating stochastic differential equations efficiently by minimum chi-squared.
*Biometrika*1997, 84(1):125-141. 10.1093/biomet/84.1.125MathSciNetView ArticleMATHGoogle Scholar - Eraker B: MCMC analysis of diffusion models with application to finance.
*Journal of Business and Economic Statistics*2001, 19(2):177-191. 10.1198/073500101316970403MathSciNetView ArticleGoogle Scholar - Roberts GO, Stramer O: On inference for partially observed nonlinear diffusion models using the Metropolis-Hastings algorithm.
*Biometrika*2001, 88(3):603-621. 10.1093/biomet/88.3.603MathSciNetView ArticleMATHGoogle Scholar - Golightly A:
*Bayesian inference for nonlinear multivariate diffusion processes, Ph.D. thesis*. Newcastle University, Newcastle, UK; 2006.Google Scholar - Elerian O, Chib S, Shephard N: Likelihood inference for discretely observed nonlinear diffusions.
*Econometrica*2001, 69(4):959-993. 10.1111/1468-0262.00226MathSciNetView ArticleMATHGoogle Scholar - Aït-Sahalia Y: Maximum likelihood estimation of discretely sampled diffusions: a closed-form approximation approach.
*Econometrica*2002, 70(1):223-262. 10.1111/1468-0262.00274MathSciNetView ArticleMATHGoogle Scholar - Lo AW: Maximum likelihood estimation of generalized Ito processes with discretely sampled data.
*Econometric Theory*1988, 4: 231-247. 10.1017/S0266466600012044MathSciNetView ArticleGoogle Scholar - Poulsen R:
*Approximate maximum likelihood estimation of discretely observed diffusion processes.*Centre for Analytical Finance, University of Aarhus; 1999.Google Scholar - Pedersen AR: A new approach to maximum likelihood estimation for stochastic differential equations based on discrete observations.
*Scandinavian Journal of Statistics*1995, 22: 55-71.MathSciNetMATHGoogle Scholar - Spall JC: Estimation via Markov chain Monte Carlo.
*IEEE Control Systems Magazine*2003, 23(2):34-45. 10.1109/MCS.2003.1188770View ArticleGoogle Scholar - Oksendal B:
*Stochastic Differential Equations: An Introduction with Applications*. Springer, Berlin, Germany; 2003.View ArticleMATHGoogle Scholar - Bishwal JPN:
*Parameter Estimation in Stochastic Differential Equations*. Springer, New York, NY, USA; 2007.MATHGoogle Scholar - Kloeden P, Platen E:
*Numeric Solutions of Stochastic Differential Equations*. Springer, New York, NY, USA; 1992.View ArticleMATHGoogle Scholar - Brandt MW, Santa-Clara P: Simulated likelihood estimation of diffusions with an application to exchange rate dynamics in incomplete markets.
*Journal of Financial Economics*2002, 63(2):161-210. 10.1016/S0304-405X(01)00093-9View ArticleGoogle Scholar - Durham GB, Gallant AR: Numerical techniques for maximum likelihood estimation of continuous-time diffusion processes.
*Journal of Business and Economic Statistics*2002, 20(3):297-316. 10.1198/073500102288618397MathSciNetView ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.