Acoustic analysis is an area which attracts researchers in an increasingly manner, by representing an important tool for studying of applications in which the voice signal is present. Systems such as voice segmentation, voice coding, identification, and classification of pathologies and voice disturbances emulation, among others can be developed or improved based on acoustic analysis.
Particularly in the case of pathology emulation, research results found in the literature aim to obtain methods which can discriminate between the pathological voices and the healthy ones. In this case, the acoustic analysis can be combined with techniques that accomplish the direct observation of the vocal cords, with the objective of obtaining indicators that can identify disturbances in the voice.
Several pathologies of the voice can be detected by means of the observation of the vocal cords, which are one of the main tissues that involve the voice production. However, to make the identification of the vocal cord disturbances viable, the technique that makes it possible to analyze the acoustics must be the most faithful possible in its representation, during the phonation process.
In this context, the mathematical modelling that represents the behavior of the vocal cords during phonation, as described in this article, is a powerful acoustic analysis technique. From this method is possible to generate a voice signal in the time domain and estimate the voice power spectral density, by means of mathematical expressions which parameters can be adjusted to the emulation of the healthy and pathological voice.
In the voice production process, a sub-glottal pressure causes the separation of the vocal cords, and due to the Bernoulli effect, which explains a reduction in the supra-glottal pressure against the internal sides of each vocal fold, which comes together again and the air travels through the glottis at a higher speed. The opening and closing cycle of the glottis repeats, generating a train of pulses which feeds the vocal tract. This whole process is only possible because the vocal cords are elastic [4].
Particularly in the case of the vowel phonemes, this procedure causes the vibration of the vocal cords. On average, the vocal cords vibrate at each period To = 1/Fo s, or, in other words, the vocal cords vibrate at a rate given by the fundamental frequency Fo [9].
However, the vibration frequency of the vocal cords is constantly changing while different patterns of intonation in the sentences are pronounced. Thus, a certain frequency F that is produced by a certain speaker has its value altered all the time during the speech. In this case, along the duration of a locution, the fundamental frequency is obtained for a brief moment, with frequencies larger and smaller than the average frequency being obtained.
The proposed voice production model is composed of five parts, as presented in Fig. 1: pulse generator, glottal pulse, gain, vocal tract, and radiation.
Differently from other works presented in the literature, the new voice formation model is based on the biophysics of phonation and has the characteristic of modeling the glottal flux taking into consideration the cyclostationary vibratory movement of the vocal cords.
In this case, the variation of the fundamental frequency as the speech occurs, as well as the gain parameter, related to the pressure of the air coming from the diaphragm, added to the glottal flux, are modeled with the purpose of obtaining a voice generation model which allows the relation of parameters with biomedical data and which has the possibility of adjusting parameters to emulate voice pathologies.
The way the voice is generated, the vibration frequency is determined by the elasticity, mass, and especially the longitudinal tension applied to the vocal cords. In a secondary manner, it is affected by the vertical tension obtained by the elevation lowering of the larynx, as well as the variation of the sub-glottal pressure.
In the proposed model, the source of excitation is controlled by an electric signal (control signal) M(t) that can be measured at the vocal cords, whose amplitude controls the liberation of glottal pulses. This signal commands the vibration mechanism of the vocal cords, and mathematically, the relation between the variation of the vibration frequency and tension can be expressed by
$$ \Delta \omega=\omega_{o}\alpha_{1} T(t), $$
(2)
in which Δω is the frequency variation of the impulses in the impulse sequence, T(t) is the mechanical tension at the vocal cords, and α1 is a sensibility constant of the process and ωo = 2πFo.
The mechanical tension T(t) can be written directly proportional to signal M(t), so that
$$ T(t)=\frac{1}{\alpha_{2}}M(t), $$
(3)
in which M(t) is an electric signal that can be measured at the vocal cords and α2 is a sensibility constant of the process.
Thus
$$ \Delta \omega = \omega_{o}\frac{\alpha_{1}}{\alpha_{2}} M(t). $$
(4)
Thus it is possible to define
$$ \beta=\frac{\alpha_{1}}{\alpha_{2}}, $$
(5)
in which β represents the relation between the sensibility constants of the process.
The analysis of the accomplished production of the voice signals, using the source-filter model as a prototype, consists of defining, mainly, a mathematical model for the source excitation. Following, each step of the new model of voice production is described.
Cyclostationary impulse generation
In the process of producing human voiced sounds, a glottal pulse E(t), originated in the lungs, is forced through the opening between the vocal cords, the glottis. In this process, when the vocal cords are under a larger tension, they vibrate more, contributing to generate the high-pitched speech sounds. When the cords are under a smaller tension, their vibration is smaller, contributing to the low pitched sounds.
This process of voice generation can be modeled as the passage of an impulse train C(t) through a time invariant linear system with impulse response equal to the glottal pulse E(t).
The impulse train C(t) can be interpreted as the output of a cyclostationary impulse generator, since it consists of a sequence of impulses in time, whose position is controlled by a cyclostationary signal M(t) which works as a modulating signal.
When the random signal M(t) is not present, it is possible to consider that the vocal cords are under an average tension Fo, and thus, the signal C(t) has equally spaced impulses for a duration To and can be written in terms of a trigonometric Fourier series,
$$ C(t)=a_{0}+\sum_{k=1}^{\infty}a_{k}\cos(k\omega_{o}t)+\sum_{l=1}^{\infty}b_{l}~\sin\left(l\omega_{o}t\right), $$
(6)
in which
$$ a_{0}=\frac{1}{T_{o}}\int_{-\frac{T_{o}}{2}}^{\frac{T_{o}}{2}}\delta(t)dt=\frac{1}{T_{o}}. $$
(7)
$$ a_{k}=\frac{2}{T_{o}}\int_{-\frac{T_{o}}{2}}^{\frac{T_{o}}{2}}\delta(t)\cos(k\omega_{o}t)dt=\frac{2}{T_{o}}. $$
(8)
And,
$$ b_{l}=\frac{2}{T_{o}}\int_{-\frac{T_{o}}{2}}^{\frac{T_{o}}{2}}\delta(t)~\sin\left(l\omega_{o}t\right)dt=0, $$
(9)
so that
$$ C(t)=\frac{1}{T_{o}}+\frac{2}{T_{o}}\sum_{k=1}^{\infty}\cos\left(k\omega_{o}t\right). $$
(10)
In the case in which the signal M(t) is present, for the cyclostationary signal C(t), the intervals between the occurrences of the impulses are controlled by the integral of M(t), permitting C(t) to be written as
$$ C(t)=\frac{1}{T_{o}}+\frac{2}{T_{o}}\sum_{k=1}^{\infty} \cos\left(k\left(\omega_{o}t+\omega_{o}\beta\int_{-\infty}^{t}M(\tau)d\tau+\phi_{o}\right)\right), $$
(11)
in which
-
1
The average frequency ωo corresponds to the period To of the non-modulated train of impulses.
-
2
The phase ϕo is directly proportional to the initial position δo of the impulses, ϕo=ωoδo, and is uniformly distributed in an interval of length 2π.
-
3
The signal M(t) is an electric signal measured at the vocal cords and has a dimension of mV.
-
4
The parameter β can be seen as a sensibility constant for the modulation process of the glottis and has a dimension of V−1.
For this model, the variation of the vibration frequency of the vocal cords is also directly proportional to the signal M(t), that is
$$ \Delta \omega=\omega_{o}\beta M(t). $$
(12)
and this variation occurs in the interval
$$ 0\leq \Delta\omega\leq\omega_{o}\beta M_{{max}}, $$
(13)
in which Mmax is the maximum amplitude of the random signal M(t). Thus, the frequency deviation is such that
$$ 0\leq\beta\omega_{o}M(t)\leq\omega_{m}, $$
(14)
in which ωm is the maximum frequency of the vocal cord oscillation and the amplitude of M(t) varies in the interval
$$ 0\leq M(t)\leq \frac{\omega_{m}}{\beta \omega_{o}}. $$
(15)
Spectral analysis of the signal C(t)
Before initializing the spectral analysis of the random signal C(t) it is appropriate to consider
$$ \phi(t)=\omega_{o}\beta\int_{-\infty}^{t}M(\tau)d\tau. $$
(16)
and rewrite C(t) as
$$ C(t)=\frac{1}{T_{o}}+\frac{2}{T_{o}}\sum_{k=1}^{\infty} \cos\left(k\left(\omega_{o}t+\phi(t)+\phi_{o}\right)\right). $$
(17)
Since ϕo is a uniformly distributed random variable in an interval of length 2π, then the expected value of C(t) is a constant \(\frac {1}{T_{o}}\). Thus, if it is possible to write the autocorrelation of C(t) in terms of only the difference between the instants of observation t and t+τ then it is possible to affirm that C(t) is a wide-sense stationary process (WSS). The autocorrelation of C(t) can be written as
$$ \begin{aligned} R_{C}(\tau)&=E \left[ \left(\frac{1}{T_{o}}+\frac{2}{T_{o}}\sum_{k=1}^{\infty}\cos\left(k(\omega_{o}t+\phi(t)+\phi_{o})\right)\right)\right.\\ &\quad\left. \cdot\left(\frac{1}{T_{o}}+\frac{2}{T_{o}}\sum_{l=1}^{\infty} \cos\left(l\left(\omega_{o}(t+\tau)+\phi(t+\tau)+\phi_{o}\right)\right)\right)\right]. \end{aligned} $$
(18)
After realizing the product of the terms, applying the expected value and using the fact that the expected value of the cosine of a random variable uniformly distributed in a interval of length 2π is null, one can write
$$ R_{C}(\tau)=\frac{1}{T_{o}^{2}}+\frac{2}{T_{o}^{2}}\sum_{k=1}^{\infty} E\left[\cos\left(k\left(\omega_{o}\tau+\left(\phi(t+\tau)-\phi(t)\right)\right)\right)\right]. $$
(19)
According to [10], the random process ϕ(t+τ) can be approximated by a mean square error linear estimator, so
$$ \phi(t+\tau)\approx \phi(t)+\tau\phi^{\prime}(t). $$
(20)
Thus, RC(τ) can be rewritten as
$$ R_{C}(\tau)=\frac{1}{T_{o}^{2}}+\frac{2}{T_{o}^{2}}\sum_{k=1}^{\infty} E\left[\cos\left(k\left(\omega_{o}\tau+\tau\phi^{\prime}(t)\right)\right)\right]. $$
(21)
Applying the Euler’s formula for cosine function, RC(τ) can be rewritten as
$$ R_{C}(\tau)=\frac{1}{T_{o}^{2}}+\frac{1}{T_{o}^{2}}\sum_{k=1}^{\infty} E\left[e^{jk(\omega_{o}\tau+\tau\phi^{\prime}(t))}\right] +\frac{1}{T_{o}^{2}}\sum_{k=1}^{\infty} E\left[e^{-jk(\omega_{o}\tau+\tau\phi^{\prime}(t))}\right]. $$
(22)
At this point of the development, it is important to remember that ϕ(t) is an instantaneous phase defined by Formula (16), so its derivative is a random instant frequency, represented by ω(t). So RC(t) can be written as
$$ R_{C}(\tau)=\frac{1}{T_{o}^{2}}+\frac{1}{T_{o}^{2}}\sum_{k=1}^{\infty}e^{jk\omega_{o}\tau}E\left[e^{jk\tau\omega(t)}\right] +\frac{1}{T_{o}^{2}}\sum_{k=1}^{\infty}e^{-jk\omega_{o}\tau}E\left[e^{-jk\tau\omega(t)}\right]. $$
(23)
Both of the expected values in this expression correspond, respectively, to the characteristic function of ω(t) and its complex conjugate sampled at kτ. If this function is denoted φω(t)(v), then RC(τ) can be written as
$$ R_{C}(\tau)=\frac{1}{T_{o}^{2}}+\frac{1}{T_{o}^{2}}\sum_{k=1}^{\infty}e^{jk\omega_{o}\tau}\varphi_{\omega(t)}(k\tau) +\frac{1}{T_{o}^{2}}\sum_{k=1}^{\infty}e^{-jk\omega_{o}\tau}\varphi_{\omega(t)}^{\ast}(k\tau). $$
(24)
From this expression, and the approximation ϕ(t+τ)≈ϕ(t)+τϕ′(t), it is possible to write RC(τ) as a function of τ and say that C(t) is approximately wide-sense stationary.
According to 25, the power spectral density of C(t) can be obtained calculating the Fourier Transform of RC(τ).
$$ S_{C}(v) ~=~ \int_{-\infty}^{\infty} ~R_{C}(\tau)~e^{-j\omega \tau} d\tau $$
(25)
Considering the definitions of characteristic function and continuous time Fourier transform, the PSD of C(t) can be written as
$$ \begin{aligned} S_{C}(v)&=\frac{1}{T_{o}^{2}}\sum_{k=1}^{\infty}\int_{-\infty}^{\infty}f_{\Omega}(\omega) \int_{-\infty}^{\infty}e^{-j\left(v-k\omega_{o}-k\omega\right)\tau}d\tau d\omega\\ &\quad+\frac{1}{T_{o}^{2}}\sum_{k=1}^{\infty}\int_{-\infty}^{\infty}f_{\Omega}(\omega) \int_{-\infty}^{\infty}e^{-j\left(v+k\omega_{o}+k\omega\right)\tau}d\tau d\omega\\ &\quad+\frac{2\pi}{T_{o}^{2}}\delta(v). \end{aligned} $$
(26)
Considering the fact, from the Dirac delta function theory, which
$$ \delta\left(t-t_{o}\right)=\frac{1}{2\pi}\int_{-\infty}^{\infty}e^{-j\omega\left(t-t_{o}\right)}d\omega, $$
(27)
the PSD SC(v) can be rewritten as
$$ \begin{aligned} S_{C}(v)&=\frac{2\pi}{T_{o}^{2}}\sum_{k=1}^{\infty}\int_{-\infty}^{\infty} f_{\Omega}(\omega)\delta\left(v-k\omega_{o}-k\omega\right)d\omega\\ &\quad+\frac{2\pi}{T_{o}^{2}}\sum_{k=-\infty}^{-1}\int_{-\infty}^{\infty} f_{\Omega}(\omega)\delta\left(v-k\omega_{o}-k\omega\right)d\omega\\ &\quad+\frac{2\pi}{T_{o}^{2}}\delta(v). \end{aligned} $$
(28)
Applying the filtering and scaling in time properties of the Dirac delta function, the expression SC(v) can be rewritten as
$$ S_{C}(v)=\frac{2\pi}{T_{o}^{2}}\sum_{k=1}^{\infty}\frac{1}{|k|}f_{\Omega}\left(\frac{v}{k}-\omega_{o}\right) +\frac{2\pi}{T_{o}^{2}}\sum_{k=-\infty}^{-1}\frac{1}{|k|}f_{\Omega}\left(\frac{v}{k}-\omega_{o}\right) +\frac{2\pi}{T_{o}^{2}}\delta(v), $$
(29)
which can be further rewritten as
$$ S_{C}(\omega)=\frac{2\pi}{T_{o}^{2}}\delta(\omega)+\frac{2\pi}{T_{o}^{2}} \sum_{\substack{k=-\infty\\k\neq 0}}^{\infty}\frac{1}{|k|}f_{\Omega}\left(\frac{\omega}{k}-\omega_{o}\right). $$
(30)
Using the fact that
$$ \phi(t)=\omega_{o}\beta\int_{-\infty}^{t}M(\tau)d\tau $$
(31)
and that \(\Omega (t)=\frac {d}{dt}\phi (t)\) is given by
$$ \Omega(t)=\omega_{o}\beta M(t) $$
(32)
then
$$ f_{\Omega(t)}(\omega)=\frac{1}{\omega_{o}\beta}f_{M(t)}\left(\frac{\omega}{\omega_{o}\beta}\right). $$
(33)
Substituting this result in (30), SC(ω) can be rewritten as
$$ S_{C}(\omega)=\frac{2\pi}{T_{o}^{2}}\delta(\omega)+\frac{2\pi}{\beta\omega_{o}T_{o}^{2}} \sum_{\substack{k=-\infty\\k\neq 0}}^{\infty}\frac{1}{|k|}f_{M(t)}\left(\frac{\omega}{k\beta\omega_{o}}-\frac{1}{\beta}\right). $$
(34)
Probability distribution of M(t)
For M(t), a continuous time WSS random process with probability distribution equal to distribution of zero crossings of the voice signal is proposed.
This proposal is based on the voice production model, described by a linear time-invariant filter, in which there is no generation of new frequencies when the glottal flux transverses the vocal tract. In this case, the variation of tension and the vocal cord oscillation frequency is directly related to the zero crossings obtained in the voice waveform.
The fundamental frequency consists of an average frequency reached by each orator. However, in a speech, the rate of variation of the vocal cords can be larger or smaller than the fundamental frequency. In light of this, two probability distributions are proposed to model the amplitude variation of the control signal M(t): unilateral gamma and Rayleigh.
-
1
Unilateral gamma distribution
The unilateral gamma distribution can be characterized by the PDF (probability density function)
$$ f_{X}(x)~=~ \frac{1}{\Gamma(k_{x}) \theta^{k_{x}}} x^{k_{x}-1} e^{-\frac{x}{\theta}} u(x), $$
(35)
in which kx and θ represent, respectively, the format and scale parameter, and u(x) the unitary degree function.
-
2
Rayleigh Distribution
The Rayleigh distribution can be characterized by the PDF
$$ f_{X}(x)~=~ \frac{x}{\sigma^{2}} e^{-\frac{x^{2}}{(2\sigma^{2})}}~ u(x)~ \text{para} ~ x~\geq~0, $$
(36)
in which σ is the scale parameter for the Rayleigh PDF.
4.1 Glottal pulse model
In the literature it is possible to find some mathematical models to represent the glottal pulse analytically. Although there are many different quantities of parameters between the models, all of them have similarities in their characteristics, like representing the glottal pulse always positive or null, considering the glottal pulse almost periodic and as continuous function in time.
Besides that, the functions which represent the glottal flux are differentiable in time, except in the opening and closing instants of the glottis. Rosenberg’s glottal model [11], Fant’s [12], Liljencrants-Fant’s (LF) [13] and Klatt’s [14], are some of the glottal flux models found in the literature.
The LF model represents the derivative of the glottal flux and is divided in two segments. The first comprehends the opening process of the vocal cords. This segment initializes at the instant to, when the vocal cords are closed, until the instant te, when the glottis returns to its initial state, after opening, whose derivative assumes its maximum negative value, −Ee [15]. Mathematically, this segment can be written a
$$ \begin{aligned} E(t)~=~E_{o} e^{\alpha t} ~\sin(\omega_{g} t), & ~ t_{o}~ \leq~t~\leq t_{e},\\ \end{aligned} $$
(37)
in which ωg is the increase rate of the amplitude, determined by α, and Eo is a scale factor to reach an area.
The second segment of the glottal pulse consists of an exponential function which models the phase of return from the main excitation to the total closure phase [16]. This segment starts at the instant te and ends at the instant tc, whose duration is also Tb. Mathematically, this segment can be described by
$$ \begin{aligned} E(t)~=~\frac{-E_{e}}{\epsilon T_{a}} \left(e^{-\epsilon (t~-~t_{e})} ~-~e^{-\epsilon T_{b}} \right), & ~t_{e}~\leq~ t~\leq ~t_{c}, \end{aligned} $$
(38)
in which ε is a constant decaying to the phase recovery of the exponential.
The main parameter for the second segment is Ta, which represents the efficiency to return the phase. For the Liljencrants-Fant model, the glottal flux, U(t), is given by
$$ \begin{aligned} U(t)~&=~\frac{E_{o}e^{\alpha t} \sin\left(\omega_{g}t ~-~ \text{\rmfamily{arctan}} \frac{\omega_{g}}{\alpha} \right)}{\sqrt{\alpha^{2}~+~\omega_{g}}}~+~\frac{E_{o}\omega_{g}}{\alpha^{2}~+~\omega_{g}}, \\ &~\text{for} ~ t_{o}~ \leq~t~\leq t_{e}.\\ &=~\frac{E_{e}}{\epsilon^{2} T_{a}} \left[e^{-\epsilon(t-t_{e})} ~+~\epsilon e^{-\epsilon T_{b}} \left(t~-~ \left(t_{c} ~+~\frac{1}{\epsilon} \right)\right) \right], \\ & ~\text{for} ~ t_{e}~\leq~ t~\leq ~t_{c}. \end{aligned} $$
(39)
In this article, the frequency domain representation of the glottal pulse derivative is given by the Expression 40
$$ \begin{aligned} E(\omega)~=~& \frac{\sqrt{\alpha_{e}^{2} ~+~\beta_{e}^{2} \omega^{2}} e^{-j\left(\omega t_{e} ~-~\text{arctan}\left(\frac{\beta_{e}}{\alpha_{e}}\omega \right) \right)}}{(\omega~+~j\alpha)^{2}-\omega_{g}^{2}} +\frac{E_{e}}{\epsilon T_{a}} (t_{c}-t_{e}) e^{\frac{-j\omega}{2} (t_{c}+t_{e})}\\ &-\frac{\sqrt{\alpha_{o}^{2} ~+~\beta_{o}^{2} \omega^{2}} e^{-j\left(\omega t_{o} ~-~\text{arctan}\left(\frac{\beta_{o}}{\alpha_{o}}\omega \right) \right)}}{(\omega~+~j\alpha)^{2}-\omega_{g}^{2}}\\ &.\left[ e^{-\epsilon T_{b}}\text{Sa}\left(\frac{(t_{c}-t_{e})}{2}\omega \right) - \frac{1}{2} e^{-\frac{\epsilon}{2}(t_{c}-t_{e})} \text{Sa}\left(j\frac{(t_{c}-t_{e})}{2}(\epsilon +j\omega) \right) \right], \end{aligned} $$
(40)
in which Sa(x) is the sample function, that is sin(x)/x.
4.2 Vocal tract
Vocal tract is the space between the vocal folds and the lips. In the process of voice generation, the glottal flux is the entrance to the vocal tract, whose muscles cause the movement of the articulators, which, in turn, change the shape of the vocal tract, causing the production of different sounds.
Compared to the movement of the vocal folds, the vocal tract changes shape relatively slowly. The minimum time interval necessary for nerves and muscles to vary the articulations which participate in the speech formation, correspond to the duration of a phone. This duration is of the order of 50 ms, which represents the emission of 20 phones per second [17, 18].
During the voice production, the vocal tract is excited by a generator of pulses produced by the vocal folds for the formation of sonorous sounds, and, for the case of non-sonorous sounds, by turbulent air passing through the constrictions of the vocal tract. The vocal tract acts as a resonant filter, whose different configurations of articulators define the distinct formative frequencies, which have the objective of molding the frequency spectrum of the sound which propagates through its cavities. In general, for the generation of phones, three to five formatives are necessary.
In this article, the characteristic frequencies of the vocal tract are estimated by the linear predictive coding (LPC) method, which represents future samples by the linear combination of precedent samples, besides determining the fundamental frequency, spectrum, formatives, among other parameters [19–21].
In the LPC, the transference function of the vocal tract is given by
$$ H(z) ~=~ \frac{1}{1-\sum_{h=1}^{p} a_{h} z^{-h}}, $$
(41)
which consists of a all-pole filter, with all poles in a unitary radius circle, such as z = e−jω, as
$$ H(\omega) ~=~ \frac{1}{1-\sum_{h=1}^{p} a_{h} e^{-jh\omega}}. $$
(42)
4.3 Power spectral density of voice signal
The prototype for the voice generation is based on the source-filter model, whose main difference is the modelling of the cyclostationary vibration of the vocal cords. The model presents the voice as a signal produced from a invariant-time linear system, in a interval in which the voice can be considered cyclostationary, typically from 16 to 32 ms, being possible to estimate the behavior of the voice signal in the frequency domain.
The proposed source-filter model considers the voice generation in a independent steps base, which are excitation model, vocal tract and radiation.
In the new voice production model, the source of excitation takes into consideration the cyclostationary movement of the vocal cords, based on its physical parameters, such as tension, mass and length. Since for a certain orator, the mass of the vocal cords is fixed and the length varies moderately, the vocal cords vibration is strongly related to the longitudinal tension applied to them.
In this context, the frequency of oscillation of the vocal cords is considered directly proportional to the tension to which they are submitted, controlled by a signal which characteristics are present in the voice signal waveform.
The control signal is given in the time domain and its period is inversely proportional to the longitudinal tension applied to the vocal cords. The new voice production model considers then that the voice signal is a result of the convolution between the signal resulting from the cyclostationary impulse generator controlled by the tension, glottal pulse, response to the impulse of the vocal tract and radiation at the lips, as illustrated in Fig. 1 and mathematically presented by
$$ V(t)~=~ GC(t) ~* ~E(t) ~*~H(t)~*~L(t), $$
(43)
in which V(t) represents the output signal, C(t) the impulse train controlled by the tension signal, E(t) the glottal pulse, H(t) the impulse response of the vocal tract, G a positive gain related to the power of the air that comes from the diaphragm and L(t) the effect of the radiation.
The effect of the radiation at the lips and nostrils is jointly represented by a high pass filter approximated by a first order derivative in the time domain, meaning that the derivative of the glottal flux is the excitation for the vocal tract. The radiation step amplifies the high frequencies with an average gain of 6 dB per octave and mathematically is given by
$$ L(\omega)~=~ 1 ~-~ \alpha e^{-j\omega}, $$
(44)
in which α is the lips/nostrils radiation coefficient which, usually, assumes values between 0.95 and 0.99 so that the zeros stay located inside the unitary circle in the z plain.
The proposed model assumes that each of the subsystems for voice generation is a invariant-time linear filter. In this scenario, at each step of the generation process, the resulting power spectral density is given by multiplying the input signal by the square module of the filter frequency response [22, 23].
This way, the power spectral density of the voice given by the new production model is given by Expression 45, in which Sc(ω) represents the PSD of the impulse train which excites the vocal folds, E(ω) is the frequency response of the glottal pulse model, G is the gain associated to the air power and H(ω) and (ω) are the selectivity in frequency of the vocal tract and effect of the radiation, respectively.
$$ \begin{aligned} S_{V}(\omega)~=~& G^{2} S_{c}(\omega) |E(\omega)|^{2} |H(\omega)|^{2} |L(\omega)|^{2}. \end{aligned} $$
(45)
The purpose of the developed voice production model is the possibility of, based on its mathematical expressions, accomplish adjustments of the parameters for detection and classification of vocal cord pathologies.
Since the vibratory behavior of the vocal cords is altered when facing pathologies, the parameters for the obtained expressions, such as probability density function, sensibility constant, and representation of the train of cyclostationary impulses in time, can be adjusted in order to adapt to a healthy voice, as well as to the characteristics of each pathology.