Analysis of Acoustic Features in Speakers with Cognitive Disorders and Speech Impairments

This work presents the results in the analysis of the acoustic features (formants and the three suprasegmental features: tone, intensity and duration) of the vowel production in a group of 14 young speakers suffering different kinds of speech impairments due to physical and cognitive disorders. A corpus with unimpaired children’s speech is used to determine the reference values for these features in speakers without any kind of speech impairment within the same domain of the impaired speakers; this is 57 isolated words. The signal processing to extract the formant and pitch values is based on a Linear Prediction Coefficients (LPCs) analysis of the segments considered as vowels in a Hidden Markov Model (HMM) based Viterbi forced alignment. Intensity and duration are also based in the outcome of the automated segmentation. As main conclusion of the work, it is shown that intelligibility of the vowel production is lowered in impaired speakers even when the vowel is perceived as correct by human labelers. The decrease in intelligibility is due to a 30% of increase in confusability in the formants map, a reduction of 50% in the discriminative power in energy between stressed and unstressed vowels and to a 50% increase of the standard deviation in the length of the vowels. On the other hand, impaired speakers keep good control of tone in the production of stressed and unstressed vowels.


Introduction
The presence of certain speech and language disorders produces a decrease in the intelligibility of the speech in the patients affected with them [1]. In languages like Spanish, vowels are the nuclei of every syllable and play an important role in the intelligibility of speech, so the decrease in their quality and discriminative power has a major effect in the overall intelligibility of the speech. The goal of this work is to analyze and characterize this loss of intelligibility in a group of young speakers with cognitive disorders.
Several analytic studies have been carried out in studying the vocalic production of patients with different speech impairments. Cases of aphasia, disorder in the language due to brain damage, have been studied to understand their influence and the decrease of quality in the vocalic production [2,3] by these patients. Dysarthria has been also studied where claims of patients with severe affections still controlling some of their suprasegmental vocalic features have been made [4], although with a lack of fine control over them. The affection to vocalic production in speech disorders due to Down's syndrome has been also studied [5] in preand postsurgical situations. Finally, the authors did an initial approach to this kind of analysis [6,7] with the Spanish database of project HACRO containing different kinds of impaired speech [8].
In this work, it will be studied how vowel production quality varies in a group of young speakers with cognitive disorders and, sometimes severe, speech impairments associated to them like dysarthria, with respect to a set of reference unimpaired speakers. Four features will be studied: formant frequencies, fundamental frequency (tone), intensity (energy) and duration (length). Formants are the acoustic parameters required to distinguish different vowels, while tone and intensity may play the main role in the utterance of stressed versus unstressed vowels [9,10]. Finally, 2 EURASIP Journal on Advances in Signal Processing duration of vowels affects the correct perception of syllable prominence and position within the whole word or utterance [11], although its impact is not clear in Spanish language.
The organization of this paper is as follows: In Section 2, the acoustic features to be studied in this work will be presented from the point of view of acoustic and perceptual phonetics. Section 3 will introduce the young speech corpora used in this paper: the reference subcorpus and the impaired subcorpus. In Section 4 the methods for the extraction of all the studied features will be presented, as well as the reference values extracted from the unimpaired speech corpus. The results over the impaired speech corpus and the comparison with the reference values will be given in Section 5 and discussed in Section 6. Finally, the conclusions to this work will be extracted in Section 7.

Features of Spanish Vowels
This section will give a brief review on the main acoustic features of the vocalic production, focusing on their influence on the articulation of the Spanish vowels. The Spanish language contains five vowels (/a/, /e/, /i/, /o/ and /u/) clearly defined by their position in the formants map as it will be shown in the study of the reference corpus in Section 4.1. There are two allophones of the /i/ and /u/ vowels acting like glides ([ j] and [w], resp.) that, despite being close to the vowels, cannot be considered as vocalic sounds when they are unstressed vowels and make the transition to a purely vocalic sound which is the nucleus in the syllable [12]. Hence, these glides are never considered for analysis in this work. Next, we will provide a basic theory of the Spanish vowels, according to their acoustic production and their influence in the perception of speech.
2.1. Formants. Formant frequencies are the only acoustic feature needed to describe Spanish vowels, where these frequencies rely heavily on the articulatory properties of each vowel [13]. The two main articulatory properties are the horizontal position of the tongue (defining palatal or front versus velar or back vowels) and the vertical position of the tongue (defining high versus low vowels). With this classification, a low position of the tongue will produce a higher first formant; while a more palatal position of the tongue will produce a higher second formant. Higher order formants like the third or fourth formants do not have a significant impact in Spanish vowels and are not considered in this work; moreover, tone doesn't have an impact either in the distinction of vowels.

Suprasegmental Features.
There are three main acoustical features that affect the suprasegmental production in Spanish: tone, intensity and duration. In isolated words like it is the case of the work in this paper, these features mostly affect the distinct perception of stressed and unstressed vowels, although they do it in very different ways. Stress is considered in many phonetic theories as a binary feature that can be characterized as +stress or -stress, as perceived by the listener. Several trends differ in which suprasegmental feature carries most of the stress information, although nowadays it is widely accepted that tone is the main carrier of stress [15], followed by intensity. Anyways, no categorical assertion can be made in this subject, as the main prosody of the sentence and other microprosodic features can affect this perception in different utterances, as well as in the different characterization of tone in each language.
Finally, duration also has an influence in the perception of stress, but it is very affected by the fact that every syllable has a canonic length, so the duration of a stressed vowel is only comparable to the duration of the same unstressed vowel when they are the nucleus of the same syllabic structure. Otherwise, no categorical conclusion can be made from the comparison of the duration of stressed and unstressed vowels.

Corpora for Analysis
This section will present the most interesting features of the corpora used in this work for the analysis carried out in Sections 4 and 5. Further information concerning other features of the corpus can be found in [16]. The vocabulary used in the recording sessions is the 57 words from the Induced Phonological Register (RFI) [17], a very well-known speech therapy handbook in Spanish. These 57 words contain 129 syllables and 292 phonemes, with several repetitions of the vowels in different syllabic structures (90 different syllables). More precisely, the total number of vowels in the set of words is 129 (58 /a/, 18 /e/, 9 /i/, 38 /o/ and 6 /u/), each one of them being the nucleus of each one of the 129 syllables (in Section 2, it was argued how glides are considered nonvocalic sounds).
The process of the speech acquisition was made using "Vocaliza" [18]; this computer-aided speech therapy tool allows the acquisition of speech elicited from children prompting them with text, audio and images. Recordings were made in an empty classroom environment with a closetalk microphone (AKG C444L) connected to a laptop with a conventional sound card acquiring the signals in 16 kHz sampling frequency and storing them with a depth of 16 bits. The main corpus is divided into two subcorpora: unimpaired and impaired speech.

Unimpaired Speech Corpus.
The unimpaired speech subcorpus contains speech from 168 young speakers (73 males and 95 females) in the range of 10 to 18 years old attending classes at primary and secondary school in Zaragoza, Spain. Every speaker has uttered one session of the isolated words in the RFI. The total number of utterances EURASIP Journal on Advances in Signal Processing 3 in this subset of the corpus is 9576 isolated words (6 hours of signal). Recording process was fully supervised by at least a member of the research team to assure the good quality of the pronunciation and intelligibility of the utterances. Furthermore, only children with a good literacy assessment by their teachers were chosen to take part in the recordings. This subcorpus was recorded with the idea of providing a reference in the standard features in the speech of young speakers as it is well known that children' speech has special features [19].

Impaired Speech Corpus.
The impaired speech subset of the corpus contains speech from 14 young speakers whose relation in terms of age and gender is shown on Table 1. Every speaker has uttered 4 sessions of the RFI isolated words; this is, 228 isolated words per speaker and a total of 3192 isolated words in the corpus (3 hours of signal). All 14 speakers suffer from cognitive disabilities and sometimes are also physically handicapped [16]. These disabilities affect their speech, producing a decrease in the quality and intelligibility in their utterances and also severe mispronunciations of some phonemes, which are either substituted by another phoneme or completely deleted.
Every utterance in the impaired speech subcorpus was manually labeled by three different experts to determine the perception of pronunciation mistakes made by the speakers. With a pairwise interlabeler agreement of 89.65% the mispronunciation rate (substitutions and deletions) is 17.61% for the overall set of phones (vowels, glides and consonants). The results in vowel mispronunciation per speaker are shown on Table 2, where it can be seen how there is a great variability in the affection to every speaker's speech, with some speakers making nearly no mistakes, while some others reaching 20% of mistakes. Although some speakers are not making any mistakes in the vowels, this does not indicate that their voice is completely healthy, because they present some degree of dysarthria that affects their voice quality.
Average mispronunciation rate of every vowel is shown in Table 3; the mean result for the 5 vowels altogether is 7.43% of mispronunciations, where /a/ and /o/ are around 4%-5% and /e/, /i/ and /u/ are more frequent mispronounced with 9-10% of mistakes. Once again, it is to remark that this manual labeling only refers to the substituted and deleted phonemes, resembling a perceptual labeling of how human experts perceive the phonemes (as the canonical one or as any other, but not indicating which was the actual phoneme uttered by the speakers in substitution of the canonical expansion).

Acoustic Analysis and Reference Results
The acoustic analysis carried out aims to achieve a robust estimation of the four features concerned for study explained in Section 2. This Section gives a brief review of the algorithms used for the acoustic analysis and focuses on the reference results over the unimpaired subcorpus. Stateof-the-art speech processing algorithms are implemented to estimate these values following the diagram on Figure 1 as also implemented in the speech therapy tool "PreLingua" for   the improvement of phonatory controls in young children [20]. The speech processing is applied framewise (with a frame length of 25 milliseconds. and a frame shift of 10 milliseconds.) after obtaining the automated segmentation of the input speech via a Viterbi-based forced alignment. Hidden Markov Models (HMMs) used for the Viterbi alignment were trained with 3 different databases containing adult unimpaired speech: Albayzin [21], SpeechDat-Car [22] and Domolab [7]. 39-dimension Mel Frequency Cepstral Coefficients (MFCCs) vectors are used as features for the HMM alignment, composed of 12 static features and energy plus delta features plus delta-delta features. An example of the outcome of the automated segmentation over one of the utterances in the unimpaired children's subcorpus can be seen in Figure 2(a). The automated segmentation is initially based on the canonic transcription of every one of the utterances (isolated words) but, to avoid the pernicious effect of phoneme deletions in the impaired speakers' pronunciations, the deleted phonemes (as perceived in the human labeling) are not fed as input into the automated segmentation, as shown in the example in Figure 2(b). After segmentation, impaired speech will be studied in two different groups: correctly pronounced vowels and mispronounced vowels. This way, the intelligibility will be studied separately in the situations in which the labelers still understand the vowel as correctly pronounced and in the situation of perception of mispronunciations.

Feature Estimation.
The feature estimation is carried out following the next steps: after signal preprocessing (DC offset, pre-emphasis and Hamming windowing), a Linear Prediction Coefficient (LPC) analysis [23] is applied to every    (1).
where the input signal s(n) is estimated as s(n) using the time-domain impulsional response h(n) associated to H(z) as in (2): The estimation of the formants takes the 16 LPC coefficients (a k ) in the prediction model H(z) and extracts the polynomial roots, each one of them associated to a formant frequency. The roots with the two higher absolute values will correspond to the first and second formants.
Tone estimation calculates the autocorrelation of the prediction error e(n) given in (3) and its autocorrelation r(k) in (4) with f r l the value of frame length (25 milliseconds per frame): The index k in which the autocorrelation has its maximum value outside from the area around the origin r(0) will be the pitch period (k pitch ) associated to the pitch frequency (F pitch = F sample /k pitch ) where F sample is 16 kHz as mentioned before. An estimation of the sonority value, as the ratio between the maximum value of autocorrelation and the autocorrelation in the origin (r(k pitch )/r(0)), will indicate if the frame is sonorant enough to be considered as a vowel and, hence, use the calculated pitch and formant values as correct. A high sonority ratio avoids the possibility of pitch and formant prediction mistakes, although some correct frames might be rejected.
For the intensity estimation, some arguments have to be considered. First, actual values of intensity (this is, sample values or directly computed frame energy) cannot be considered into the study as it is not possible to reliably argue that input intensity during the recording process stayed steady through all different sessions, as the recordings of all the speakers took more than a year. However, it is reasonable to argue that Signal-to-Noise Ratio (SNR) will maintain constant for similar speech intensity independently of the input volume since a close-talk microphone was used for the recordings.
This assumption is evaluated by the estimation of the background noise power level calculated for the corpus used in the work, whose mean value is 27.15 dB (      standard deviation) for the reference subcorpus and 27.07 dB (6.61 dB of standard deviation) for the impaired subcorpus, which validates the hypothesis that noise level is directly related to intensity level and maintains similar and good properties through all the recordings. Hence, prior to energy estimation, average background noise power is calculated through all the frames considered as nonspeech in the forced alignment. Afterwards, for each frame of the vowels, framewise energy is calculated and SNR is obtained by subtracting the noise power in the utterance. For convention purposes, from now on, intensity or energy will be this value of SNR where the background noise level has been substracted. Duration calculation is done by estimating the length of the vowel in milliseconds, computing the number of frames assigned to each vowel in the forced alignment and then multiplying by the frame shift value of 10 milliseconds per frame. A threshold over the energy is applied to restrict the vowel boundaries and hence avoid the effect of coarticulation in the transitions to or from consonantal sounds. This threshold was preset to restrict boundary frames with low energy whose calculation of pitch and formants could be inaccurate.

Reference
Results. The reference subcorpus of 168 unimpaired young speakers was initially analyzed to determine the standard values of the formants and suprasegmental features under study in this work. Some general assumptions will be made in this paper concerning the statistical properties of the features studied in this work: First, the values of the formants have a 2-dimension Gaussian distribution for each vowel. Values of pitch and energy have a Gaussian distribution separately for stressed and unstressed vowels (pitch can only be considered for one speaker alone or for a population of the same gender and age). Finally, the values of vowel length have a Gaussian distribution for each vowel.
All the values in this Section are given in terms of mean (μ), standard deviation (σ), skewness (γ 1 ) and excess Kurtosis (γ 2 ); where the values (close to zero) of γ 1 and γ 2 validate the Gaussian assumptions. Once assured the Gaussian properties, in the studies on the impaired subcorpus in Section 5, μ and σ will be the only statistics. All reference  values are shown on Tables 4 (formants), 5 (pitch), 6 (energy-SNR) and 7 (length). Table 5 shows only the results for the group of unimpaired females of 13-14 years old as an example of pitch trend in the unimpaired data (rest of groups behave similarly and are not shown here to restrict space of this Section, it is to remember that pitch has to be studied separately for gender and age to maintain the condition of Gaussian distribution). A graphical representation of these features is given in Figure 3. First, Figure 3 Referring to the formant results in Table 4 and Figure 3(a), the values are similar to the canonic formant values accepted traditionally in Spanish phonetics, and a good discrimination can be made among all five vowels. Pitch and energy in Table 5 and 6 and Figures 3(b) and 3(c) show their discriminative effect in the perception of stress, as the pitch in stressed vowels is 10-20 Hz over the pitch of unstressed vowels and the energy in stressed vowels is 4-7 dB over the energy of unstressed vowels. Finally, regarding length in Table 7 and Figure 3(d), it is seen as vowel production is steady in its length, with a standard deviation not exceeding the 40%-50% of the mean results in length (around 120 milliseconds).

Impaired Speech Results
In this section, the results achieved in the acoustic analysis over the impaired speech subset of the corpus will be given. This analysis will comprehend the four acoustic features considered in Section 2, while making an initial comparison with the results in Section 4.2 over the reference subcorpus. The full comparative analysis will be made in Section 6 with the help of statistical tools like the Kullback-Leibler Divergence and the Fisher Ratio. Figure 4. Figure 4(a) provides the formant map for the vowels perceived as correctly pronounced by the human labelers, with their statistics given in the first columns of Table 8. Two major effects can be appreciated: First, the increase in the area of every vowel in the formant map in Figure 4(a), which is appreciated as an increase in the standard deviation of the formants in Table 8 when compared to the formants of the reference speakers in Table 4. And second, the approximation of vowels /a/, /e/ and /o/ towards the center of the formants map in Figure 4(a), also appreciated in the mean results in Table 8.

Formant Results. The formant map for the 14 impaired speakers is shown on
Concerning the results for the vowels perceived as mispronounced by the human labelers, given in Figure 4(b) and the second half of Table 8, there can be appreciated the total confusion in the formants, as expected in this case where a mistake in the pronounced vowel has been made by the speakers. In this case, all the formants are centered in the middle of the formant map and the standard deviation is much higher. In this case, what the speakers are really uttering is different from the canonical vowel to be expected and the production of speech is blurred in the formant map, as the labelers were not told to indicate what the speaker was really saying.

Tone (Pitch)
Results. The study of the pitch values for the impaired subcorpus should best given separately for every speaker; however, the lack of sufficient data for a correct statistical analysis (especially when studying mispronounced vowels) leads to the need of gathering speakers in groups with similar pitch values. Hence, 4 groups are created, (i) Group A gathers speakers Spk03, Spk06, Spk07 and Spk12 (4 of the older males with very low pitch values).
(iii) Group C gathers speakers Spk04, Spk09, Spk13 and Spk14 (4 females with a medium-high pitch values) (iv) Group D gathers speakers Spk01 and Spk02 (male and female with a high pitch).
The results for the 4 groups of speakers are given in Tables 9 (correctly pronounced vowels) and 10 (mispronounced vowels, where some values are missing due to the not existence of data for those cases).
It can be seen as impaired speakers keep a good control of these prosodic features: Values of pitch are steady among all five vowels and speakers show the ability to discriminate stressed vowels from unstressed vowels in all 5 vowels in similar ways to reference speakers (with 10-20 Hz of separation between stressed and unstressed vowels). We have to consider with caution the results in the case of mispronounced vowels, as the nonexistence of some cases leads to strange results.

Intensity (Energy) Results.
Regarding the values of framewise energy (SNR as explained on Section 4), the average results for all the impaired speakers are given in Table 11. It is seen how energy keeps good properties for the impaired speakers, and they are able to produce an increase in their intensity production when uttering stressed vowels, although compared to the reference results in Section 4.2 there is a slight increase in the energy of unstressed vowels. On the other hand, a reduction in the energy in stressed vowels is noticed in the vowels labeled as mispronunciations.

Duration (Length) Results.
The statistics for the results of the vowel length in the group of 14 impaired speakers are shown on Table 12. It can be seen that there is an increase in the average length of around 15 milliseconds for all vowels when compared to the reference speakers in Table 7, but what it is more noticeable is the increase in standard deviation (more than 50%), which indicates the presence of vowels with a very variable length, meaning the existence of extremely long and extremely short vowels, as there is no significant change in the skewness and Kurtosis of the statistics. The increase in standard deviation is especially noticeable in the mispronounced vowels, which indicates that what the speakers are really uttering instead of the vowels is a non steady realization of speech. This clearly might be  indicating that speakers are unsure of their production of speech, so they are trying to skip that vowel (making it shorter) or making it longer while they try to pronounce the right sound.

Discussion
The results obtained in Section 5 can give way to a discussion on several aspects of the vocalic production of impaired speakers. The discussion in this section will come accompanied with the computation of the Kullback-Leibler Divergence (KLD) and the Fisher Ratio (FR) [24]. These two measures are known to provide a good metric of the discriminative power of two different random variables. In this work, they will help to know the discriminative separation between vowels in the formant map and between stressed and unstressed vowels in terms of tone and intensity. For this work, it will be considered the KLD definition for n-dimensional Gaussians distributions (2-dimensional in the case of formants and 1-dimensional in the other features). This definition, considered for two distributions A ∼ ℵ(μ A , Σ A ) and B ∼ ℵ(μ B , Σ B ) where μ A and μ B are mean vectors, Σ A and Σ B diagonal covariance matrices and n the dimension of the distributions, is given by (5): However, given this definition, the KLD is nonsymmetric, this means that KL(A, B) / = KL(B, A), so a symmetrized KLD (sKLD) is defined in (6): Finally, the FR equation for the two n-dimensional Gaussian distributions A ∼ ℵ(μ A , Σ A ) and B ∼ ℵ(μ B , Σ B ) is given in (7): Concerning the formants (the only acoustic feature of the vowels) there is an important decrease in sKLD and FR in the formant map between the vowels /a/, /e/ and /o/ in Table 13, while vowels /i/ and /u/ separate from the other 3 vowels, increasing their sKLD and FR in the formant map.
However, this is not a precise vision of the situation, because it is not to be forgotten than these two vowels are the less likely seen in Spanish language; not only in the vocabulary of this work in Section 3, but also in some other major text corpora in Spanish like the Europarl corpus [25], where the percentage of appearances of vowels is 11.83% for /e/, 9.51% for /a/, 8.07% for /o/ and only 4.28% for /i/ and 1.74% for /u/. This way, when computing a weighted average result in sKLD and FR (last row of Table 13), where the weights are the percentage of appearances of every vowel in the vocabulary, it is seen that there is an average reduction of 20.28% in the sKLD and 30.95% in the FR between In terms of suprasegmental features, the separation between stressed and unstressed vowels are given in Tables 14 (pitch) and 15 (energy). Table 14 shows that there is not a significant decrease in the weighted sKLD and FR in pitch between unimpaired speakers and impaired speakers (when uttering correctly the vowels). This corroborates previous works [4] in the fact that impaired speakers can still control some prosodic features in their speech even they lose intelligibility in their vowel production. The results in the mispronounced vowels by impaired speakers cannot be considered due to the pernicious effect of unseen cases in the test data.
It is in terms of energy (or intensity) where impaired speakers seem to have bigger problems in the control of prosody and stress. There is a reduction of 56.26% in sKLD and 56.82% in FR in the discriminative power between these two distributions, and this reduction increases to 80% in the case of mispronounced vowels. As mentioned in Section 5, this reduction in discriminative power is mostly due to an increase in the energy of unstressed vowels. The reason for that might be in the fact that impaired speakers are trying to assure themselves in their pronunciation by raising their intensity in their situations of hesitation. This extra intensity would not affect stressed vowels because stressed vowels have higher energy due to this prosodic feature of stress affecting them.
Finally, the study of the length of the production of vowels by the impaired speakers in Table 12 shows an effect of dispersion in the length of the vowels. This means that vowels as uttered by these speakers are more often abnormally long or short. Actually, two separate effects can be appreciated; in the case of correctly pronounced vowels by the impaired speakers there is an effect of lengthening of the vowels (around 20%-30% increase in mean values between Tables 7 and 12), while mispronounced vowels are excessively dispersed (with standard deviations of 80% of the mean values), mainly due to the doubts and hesitations of incorrect pronunciations. The increase of duration of correctly pronounced vowels might indicate certain hesitations in the speakers when uttering their speech, due to their insecurity in speech production because of their speech disorders.

Conclusion
As conclusions to this work, a whole corpus with unimpaired and impaired children' speech corpus has undergone an acoustic study based on LPC analysis to calculate acoustic features like formants and suprasegmental features (pitch, energy and length). Results show that the good properties of unimpaired speakers (well-behaved formants, separation of stressed and unstressed vowels in terms of pitch and energy, and statistically correct length features) are distorted in different ways in the impaired speakers.
Impaired speakers reduce in 20%-30% the discriminative ability of the formant map, even when the pronunciation is perceived as correct by a set of human experts. Results in the case of mispronunciations show a total blur in the formant map as expected and as detected by the human experts. Impaired speakers have a good control of tone as feature for the microprosody of the words; but intensity discrimination between stressed and unstressed vowels is reduced by a 50% due to an increase in the energy of unstressed vowels. Finally, it has been shown how these speakers have problems to maintain a steady production of vowels in terms of their length, with the abnormal production of extremely long or short vowels that is reflected in an increase of 50% in the standard deviation of the vowel length.
Hence, it can be concluded that the main problem in the vowel production due to the speech disorders analyzed in this work reflects in terms of formants, intensity control and vowel length, while they are able to maintain a correct production of pitch. Further work in this area may include a more precise analysis of the formant values, considering their relationship to the pitch value of every speaker. Also, the results in this work could be validated with the results achieved with a manual segmentation of vowels; although the automated segmentation is robust enough and, altogether with the strict sonority threshold applied, assures that all the frames analyzed belong to vowels.
Further studies in the vowel duration may also be done considering a new vocabulary with the same syllables in different positions and situations of stress. Finally, a bigger study considering connected speech might be done to study the loss of prosody features in a situation of complete sentences. This study might be useful to determine if impaired speakers have problems with prosody control in a more complex context than simple control of stress features. Another study of interest would be to link these results to the outcome of a whole phonetic transcription of the speakers' speech (with a confusion matrix of the mispronunciations) and also analyze separately each speaker speech in terms of acoustic parameters, although that would require a more careful statistical analysis due to the reduction in the amount of data studied.