Phonologically-based biomarkers for major depressive disorder

Of increasing importance in the civilian and military population is the recognition of major depressive disorder at its earliest stages and intervention before the onset of severe symptoms. Toward the goal of more effective monitoring of depression severity, we introduce vocal biomarkers that are derived automatically from phonologically-based measures of speech rate. To assess our measures, we use a 35-speaker free-response speech database of subjects treated for depression over a 6-week duration. We find that dissecting average measures of speech rate into phone-specific characteristics and, in particular, combined phone-duration measures uncovers stronger relationships between speech rate and depression severity than global measures previously reported for a speech-rate biomarker. Results of this study are supported by correlation of our measures with depression severity and classification of depression state with these vocal measures. Our approach provides a general framework for analyzing individual symptom categories through phonological units, and supports the premise that speaking rate can be an indicator of psychomotor retardation severity.


Introduction
Major depressive disorder (MDD) is the most widely affecting of the mood disorders; the lifetime risk has been observed to fall between 10 and 20% and 5 and 12% for women and men, respectively [1]. In addition, the 2001 World Health Report names MDD as the most common mental disorder leading to suicide [2,3]. Currently, no laboratory markers have been determined for the diagnosis of MDD, although a number of abnormalities have been observed when comparing patients with depression to a control group [2]. Accurate diagnosis of MDD requires intensive training and experience; thus, the growing global burden of depression suggests that an automatic means to help detect and/or monitor depression would be highly beneficial to both patients and healthcare providers. One such approach relies on the extraction of biomarkers to provide reliable indicators of depression.
One class of biomarkers of growing interest is the large group of vocal features that have been observed to change with a patient's mental condition and emotional state. Examples include vocal characteristics of prosody (e.g., pitch and speech rate), spectral features, and glottal (vocal fold) excitation patterns [4][5][6][7][8][9][10][11]. These vocal features have been shown to have statistical relationships with the presence and the severity of certain mental conditions, and, in some cases, have been applied toward developing automatic classifiers. In this article, we expand on the previous study for the particular prosodic biomarker of speech rate, which has been shown to significantly separate control and depressed patient groups [12]. Specifically, we present vocal biomarkers for depression severity derived from phonologicallybased measures of speech rate. In addition, we investigate this dependence with respect to each of the symptom-specific components that comprise the standard 17item HAMD [13] composite assessment of depression. For example, supporting the premise that psychomotor retardation can be observed in the speech rate [12,14], we reveal high correlations between not only the global speech rate, but also between a subset of individual phone durations and the HAMD Psychomotor Retardation sub-topic. Although the specific focus in this article is on biomarkers derived from speech rate, we provide a general framework in which to explore the relationship between phonologically-based biomarkers and the severity of individual MDD symptoms.
In this study, we investigate the correlations between phonologically-based biomarkers and the clinical HAMD severity ratings, for a 35-speaker free-response speech database, recorded by Mundt et al. [7]. We first compute global speech rate measures and show the relationship with the HAMD total and sub-topic ratings through correlation studies; these global rate measures are computed by finding the average phone rate using an automatic phone-recognition algorithm. We then examine the correlations of the HAMD ratings with the average duration of pauses and automatic recognition-based individual English phone durations, providing a fine-grained analysis of speech timing. With regard to the pause measures, the findings with pause duration are consistent with previous total HAMD rating correlations [7], but extend the analysis to the sub-topics. With regard to the individual phone durations (vowels and consonants), higher individual correlation values than those found with the global speech rate measures reveal distinct phone-specific relationships. The individual phone durations that show significant correlations within a single HAMD category (total or sub-topic) are observed to cluster approximately within manner-of-articulation categories and according to the strength of intercorrelation between sub-topics. These significantly correlated phone lengths within a sub-topic are then selected and linearly combined to form composite durations; these composite durations result in correlation values that exceed those found not only using the individual phone durations but also the more global vocal measures that are used in our study and previous studies [7]. As an extension of the individual phone duration results, the energy spread of a phone is provided as an alternate duration measure; the energy spread measure reveals some similar phone-specific correlation patterns and more changes in correlations with burst consonants relative to those calculated from the recognition-based duration. A broad overview of our phonologically-based (fine-grained timing) framework with an included list of our key measures is illustrated in Figure 1.
We conclude with a preliminary classification investigation using our phonologically-based duration measures, guided by the significant correlations from our phone-specific results. Using a simple Gaussian-likelihood classifier, we examine the accuracy in classifying the individual symptom sub-topic ratings by designing a Figure 1 Overview of the general framework presented in this article and our specific approach. multi-class classifier where each rating level is set as its own class. The classification root mean squared error (RMSE) is reported as a measure of accuracy. Our preliminary classification results show promise as a beneficial tool to the clinician, and motivate the addition of other phone-based features in classification of depression severity.
Our results provide the framework for a phone-specific approach in the study of vocal biomarkers for depression, as well as for analyzing individual symptom categories. To further exploit this framework, the scarcity and variability of samples in our database points to a need for further experiments with larger populations to account for the variety within one group of MDD patients.

Major depressive disorder (MDD)
MDD places a staggering global burden on society. Of all the mental disorders, MDD accounts for a loss of 4.4% of the total disability-adjusted life years (DALYs) a , and accounts for 11.9% of total years lost due to disability (YLD). With current trends, projection for the year 2020 is that depression will be the second only to ischemic heart disease as the cause of DALYs lost worldwide [3].

Diagnosis and treatment
MDD is characterized by one or more major depressive episodes (MDEs), where an MDE is defined as a period of at least two weeks during which either a depressed mood dominates or markedly diminished interest, also known as anhedonia, is observed. Along with this, the American Psychiatric Association standard recommends that at least four or more of the following symptoms also be present for diagnosis: significant change in weight or appetite, insomnia, or hypersomnia nearly every day, psychomotor agitation or retardation (clearly observable by others), fatigue, feelings of worthlessness or excessive guilt, diminished ability to concentrate or decide, and/or recurrent thoughts of death or suicide [2]. These standards are reflected in the HAMD depression rating method, which encompasses multiple symptoms to gauge the overall severity of depressive state, as discussed further in the next section. Conventional methods for treatment of MDD include pharmacotherapy and/or psychotherapy; an exhaustive coverage of depression treatment is beyond the scope of this article.

Depression evaluation-HAMD
We consider the standard method of evaluating levels of MDD in patients, the clinical 17-question HAMD assessment (a detailed description of the database is given in Section 3). To determine the overall or total score, individual ratings are first determined for symptom sub-topics (such as mood, guilt, psychomotor retardation, suicidal tendency, etc.); the total score is then the aggregate of the ratings for all sub-topics. The sub-topic component list for the HAMD (17 symptom sub-topics) evaluation is provided in the Appendix. Scores for component subtopics have ranges of (0-2), (0-3), or (0-4).
Although the HAMD assessment is a standard evaluation method, there are well-known concerns about its validity and reliability [15]. Nevertheless, the purpose of this article is not to test whether the HAMD ratings (or its sub-topic ratings) are valid, but instead provide a flexible analysis framework that can be adapted to future depression evaluation standards. The interdependencies for our particular database are discussed in Section 3.

Previous studies
In this section, we provide a representative sampling of vocal features previously applied as MDD discriminators through correlation measurements and/or classification algorithms. These vocal measurements fall into the broad categories of prosody (e.g., pitch and speech rate), spectral, glottal (vocal fold) excitation, and energy (power).
We begin with an early study by Flint et al. [16] who used the second formant transition, voice onset time, and spirantization, a measure that reflects aspirated "leakage" at the vocal folds, to discriminate between MDD, Parkinson's disease, and control subjects. Although significant ANOVA (analysis of variance) differences were computed for a small feature subset, no significant correlations between any of the features and the HAMD scores were found in the depression studies.
France et al. [4] later used similar biomarkers including the fundamental frequency, amplitude modulation, formant statistics, and power distribution to classify control, dysthymic, MDD, and suicidal males and females, separately. The female vocal recordings showed spectral flattening with MDD; the results for the male recordings showed that the location and bandwidth of the first format along with the percent of total power in the 501-1000-Hz sub-band were the best discriminators between the MDD subjects and the controls.
Ozdas et al. [8,9] investigated the use of two vocal features, vocal-cord jitter and the glottal flow spectrum, for differentiating between control, MDD, and near-term suicidal risk subjects. Depressed and near-term suicidal patients showed increased vocal-cord jitter and glottal spectral slope.
Moore et al., in a series of articles [6,10], also investigated vocal glottal excitation, spectral, and prosodic characteristics. A large variety of statistical measures were then utilized to construct classifiers for distinguishing control from depressed patient groups; these classifiers were employed to infer the most differentiating feature-statistic combinations for their dataset.
Low et al. [5] combined prosodic, spectral, and the first and the second derivatives of the mel-cepstra features to classify control and clinically depressed adolescents, using a Gaussian mixture model-based classifier. With a combination of these vocal features, the final classification accuracy was able to reach 77.8 and 74.7% for males and females, respectively.
A study by Mundt et al. [7] showed that depressed patients responding to treatment significantly increased their pitch variability about the fundamental frequency more than non-responders did. This analysis also suggested that depressed patients may extend their total vocalization time by slowing their syllable rate and through more frequent and longer pause times. The results of Mundt et al. provide a springboard for our current effort. In contrast to the Mundt et al.'s study, which uses the assumed fixed number of syllables in the "Grandfather Passage" to analyze speech rate, this study focuses on the conversational freeresponse speech recordings and performs a fine-grained analysis using automatically detected individual phone durations. More detailed comparisons with the results of Mundt et al. are provided in the measurement sections of this paper, where comparative measures are analyzed.
As one of the emerging approaches to depression recognition, Cohn et al. [11] aimed at fusing facial and vocal features to create a more accurate MDD classifier. Measures of vocal prosody included average fundamental frequency and participant/speaker switch duration. Using a support vector machine (SVM) classifier, true positive and negative rates of 88 and 64%, respectively, were achieved from these vocal features.
Certain vocal features in MDD studies are also tracked in studies of vocal affect and emotion. Among these features are the changes in mean fundamental frequency, mean intensity, and rate of articulation, as well as standard spectral-based speech analysis features such as the mel-cepstrum [17,18].
The vocal biomarker studies described in this section generally take a global approach to speech, as opposed to phone-or phonological group-specific effects. In addition, these studies focus primarily on the total evaluation ratings or group-depressed patients into one large set, regardless of sub-symptom variability. In contrast, the approach of this article relies on decomposition of the speech signal into unique phones and of the total depression score into individual symptom sub-topic ratings, thus providing a unique framework for detailed analysis of unit-dependent vocal features, and how they change with individual aspects of depression severity.

Database
The data used in this analysis was originally collected by Mundt et al. [7] for a depression-severity study, involving both in-clinic and telephone-response speech recordings. Thirty-five physician-referred subjects (20 women and 15 men, mean age 41.8 years) participated in this study. The subjects were predominately Caucasian (88.6%), with four subjects being of other descent. The subjects had all recently started on pharmacotherapy and/or psychotherapy for depression and continued treatment over a 6-week-assessment period. Speech recordings (sampled at 8 kHz) were collected at weeks 0, 2, 4, and 6 during an interview and assessment process that involved HAMD scoring. To avoid telephonechannel effects, only the samples of conversational (freeresponse) speech recorded in the clinic are used in our follow-up study. In addition, we only used data from subjects who completed the entire longitudinal study. This resulted in approximately 3-6 min of speech per session (i.e., per day). More details of the collection process are given in [7].
Ratings from the 17-item HAMD clinical MDD evaluation were chosen as comparison points in our study. Individual sub-topic ratings from each evaluation (see Appendix) were also used both in our correlation studies and classification-algorithm development.
An important additional consideration is that of the intercorrelations between the HAMD symptom subtopics. Figure 2 shows all the significant intercorrelations between the HAMD sub-topics, computed with our dataset. The greatest absolute correlation of 0.64 corresponds to the Mood and Work-Activities subtopics. High significant correlations group the sub-topics of Mood, Guilt, Suicide, and Work-Activities together. Relevant to the findings in this study, the Psychomotor Retardation sub-topic has the strongest correlations with Agitation (-0.40) and Mood (0.36, not labeled).

Global rate measurements
Our approach is based on the hypothesis that general psychomotor slowing manifests itself in the speech rate, motivated by observed psychomotor symptoms of depression [12,16] and supported by previous findings of correlation between MDD diagnosis and/or severity with measures of speech rate [7]. In our study, we investigate a measure of speech rate derived from the durations of individual phones. For the phone-based rate measurements, we use a phone recognition algorithm based on a Hidden Markov Model approach, which was reported as having about an 80% phone-recognition accuracy [19]. Possible implications of phone-recognition errors are discussed in Section 5.
We compute the number of speech units per second over the entire duration of a single patient's freeresponse session. We use the term speaking rate to refer to the phone rate over the total session time, with times when the speech is not active (pauses) included in the total session time. This is in contrast to articulation rate, which is computed as the phone rate over only the time during which speech is active.
Phone rates were computed for each individual subject and session day using the database described in Section 3 (i.e., the in-clinic free-response speech in the collection by Mundt et. al. [7]). Correlations between these global rate measures and the total HAMD score, along with its sub-topics (17 individual symptom sub-topics), were all computed. For the results of this study, Spearman correlation was chosen over Pearson because of the quantized ranking nature of the HAMD depression scores and the possible nonlinear relationship between score and speech feature [20,21]. Thus, the correlation results determine whether a monotonic relationship exists between extracted speech features and depression-rating scores.
All the significant b correlations of phone rate with depression ratings are shown in Table 1. Examining the HAMD total score, we see that a significant correlation occurs between this total and the phone-based speaking rate. The articulation rate measure did not show the same correlation with HAMD total, but did show a stronger relationship with the Psychomotor Retardation rating than the more general speaking rate. The most significant correlations for both the speaking and articulation rate measures are found with the Psychomotor Retardation ratings. This finding is consistent with the fact that the HAMD Psychomotor Retardation sub-topic is a measure of motor slowing, including the slowing of speech (see Appendix).
Although the rate measurement methods adopted in this study are different, we observe certain consistencies in this study's findings with those of Mundt et al. [7]. In the Mundt et al.'s study, on the same database, speaking rate was measured in terms of syllables/second, based on the fixed number of syllables in the "Grandfather Passage". Mundt et al. found a Pearson correlation between HAMD total score and the speaking rate of -0.23 with high significance, consistent with our Spearman correlation of -0.22 for phone-based speaking rate. By computing the measures in this study from the freeresponse interview section of the recordings, instead of the read-passage recordings, we focus more on the changes in conversational speech and remove the variable of different reading styles used by the patients. In addition, the use of an automatic method allowed us to analyze much longer samples of speech, and thus obtain a more reliable estimate.

Phone-specific measurements
Up to this point, we have examined global (i.e., average over all phones) measurements of rate across utterances. In this section, we decompose the speech signal into individual phones and study the phone-specific relationships with depression severity. With this approach, we find distinct relationships between phone-specific duration and the severity of certain symptoms, presenting a snapshot of how speech can differ with varying symptom severities. We use two different definitions of phone duration: (1) phone boundaries via an automatic phone recognizer, and (2) width of the energy spread around the centroid of a signal [22] within the defined phone boundaries. Decomposition into phone-specific measures allows for a more refined analysis of speech timing.
As in Section 4, owing to the quantized nature of the rankings, Spearman correlation is used to determine whether a monotonic relationship exists between extracted speech features and depression-rating scores.

Duration from phone recognition boundaries
Using an automatic phone recognition algorithm [19], we detect the individual phones and their durations. Before proceeding with vowel and consonant phones, we will first examine the silence or "pause" region within a free-response speech session.
Pause length: The automatic phone recognition algorithm categorizes pauses as distinct speech units, with lengths determined by estimated boundaries. Both average pause length and percent total pause time are examined in the correlation measures used in this study, and the results are summarized in Table 2.
We compute the correlations between the average pause length over a single speech session and the HAMD total and corresponding sub-topic ratings; the results are shown in Table 2. The average pause length is inversely related to the overall speaking rate, and so, as seen with the phone-based global speaking rate measures of Section 4, the HAMD Psychomotor Retardation score again shows the highest correlation value. The HAMD total score, along with a large number of subtopics, shows a significant worsening of condition with longer average pause length.
The ratio of pause time measure is defined as the percent of total pause time relative to the total time of the free-response speech session. This feature, in contrast to the average pause length measure, is more sensitive to a difference in the amount of time spent in a pause period, relative to the time in active speech. Thus, a change in time spent for thinking, deciding, or delaying further active speech would be captured by the ratio of pause time measure. For this ratio, a highly significant correlation was seen with only the HAMD total score. Most of the significant correlations with total and subtopic symptom scores seen with ratio of pause time were also correlated with average pause length; the only sub-topic that does not follow this rule is the HAMD measure of Early Morning Insomnia, which shows a higher pause ratio with worsening of condition.
As shown in Table 2, we again observe consistency with certain results from Mundt et al. [7] who obtained a Pearson correlation of 0.18 (p-value < 0.01) between percent pause time and the HAMD total score, in comparison to our Spearman correlation of 0.25 (p-value = 0.009) between ratio of pause time and the HAMD total score. Mundt et al. also examined a number of pause  Phone length: The duration of consonants and vowels, henceforth referred to as phone length (in contrast to pause length), varied in a non-uniform manner over the observed depression severities. Specifically, the severity of each symptom sub-topic score exhibited different corresponding phone length correlation patterns over all of our recognition-defined phones.
In order to test the correlation between specific phone characteristics and the sub-topic ratings of MDD, average length measures for each unique phone were extracted for each subject and session day. Significant correlations (i.e., correlations with p-value < 0.05) across phones are illustrated in Figure 3 for HAMD total and sub-topic ratings. We observe that the sign and magnitude of correlation vary for each symptom sub-topic, along with which of the specific phones show significance in their correlation value. A clear picture of the manner of speech (in terms of the phone duration) while certain symptoms are present can be inferred from Figure 3.
The HAMD Psychomotor Retardation correlations stand out across a large set of phones, with positive individual correlations indicating a significant lengthening of these phones with higher Psychomotor Retardation rating. This is again consistent with the slowing of speech being an indicator of psychomotor retardation, but narrows down the phones which are affected to a small group, and reaches the high individual correlation of 0.47 with the average phone length of /t/. In contrast, there are also sub-topics that show groupings of phones that are significantly shortened with worsening of condition: for example, HAMD Insomnia Middle of the Night. Although there exist some overlaps in the unique phones that show significant correlations with ratings of condition, we see that none of the total or sub-topic correlation patterns contain exactly the same set of phones. Nonetheless, strong intercorrelations between the HAMD symptom sub-topics may be seen in the phone correlation patterns; for example, Psychomotor Retardation is most strongly correlated (negatively) with the Agitation subtopic (see Section 3); as a possible reflection of this, two phones that show a positive correlation with the Psychomotor Retardation sub-topic are negatively correlated with Agitation. We see that the total HAMD score shows relatively low or no significant correlation values with our individual phone length measures, and the few that do show some significance create a mixed pattern of shortening and lengthening of those phones. Since the total assessment score is composed by taking the sum over all sub-topics, and each sub-topic seems to have a distinct lengthening or shortening speech rate pattern related to it, the total score should only show correlations with phone lengths that have consistent positive or negative correlations across a number of sub-topics; we see that this is the case, especially with pause length (/sil/) and the phones /aa/ and /s/.
An important consideration is the correlation patterns of phones that are produced in a similar way, i.e., having the same manner of articulation. Figure 3 displays the phones in their corresponding groups; dashed vertical lines separate categories (vowel, fricative, plosive, approximant, and nasal). We examine each category individually as follows: Pauses-We include pauses in Figure 3 for comparison. As already noted, longer average pause lengths are measured with worsening of condition for a number of subtopics (see Table 2 for correlation values).
Vowels-/aa/ and /uh/ are the two vowels that show more than one significantly negative correlation with a sub-topic, indicating shortening of duration with worsening of condition. There are two groups of vowels that show a positive correlation with HAMD Psychomotor Retardation score: (1) the /aw/, /ae/, /ay/, /ao/, and /ow/ group, all of which also fall into the phonetic category of open or open-mid vowels; and (2) the /iy/, /ey/, /eh/ group, which also has correlations with the Weight loss sub-topic (in addition to the Psychomotor Retardation sub-topic), with this group falling into the phonetic category of close or close-mid vowels.
Fricatives-The fricative which has the most similar correlation pattern to any vowels is /v/, which is a voiced fricative. Consonants /s/ and /z/ both show lengthening (positive correlation) with worsening of Psychomotor Retardation; they are also both high-frequency fricatives. /s/ shows a consistent positive correlation pattern across a range of sub-topics, the correlation pattern for this fricative is most similar to the ones seen for pause length.
Plosives-With regard to Psychomotor Retardation, the three plosives which show significant positive correlations are /g/, /k/, and /t/, which are also all mid-tohigh-frequency plosives; this group also shares similar correlations for the Mood sub-topic. A smaller effect is also observed-/t/, /p/, and /b/, all of which are diffuse (created at the front of the mouth, i.e., labial and front lingual) consonants, all showing negative correlations with Middle of the Night Insomnia.
Approximants-Both /r/ and /w/ show a positive correlation with Psychomotor Retardation. The single significant correlation found for /l/ is with the Weight Loss sub-topic, which has no other correlation within the approximant group, but does show consistent correlations with respective subset of the vowel (/ih/, /iy/, /ey/, /eh/) and fricative (/v/, /f/) groups.
Nasals-The nasal /m/ had no significant correlations with HAMD rating. The nasal /n/ has two significant correlations, but does not have similar correlation patterns to any other phone. The phone /ng/ has a correlation pattern most similar to /s/ and pauses.
We provide additional analysis of the correlation patterns across phones, with respect to the intercorrelations between HAMD sub-topics, in the conclusions of Section 7.
As an extension of the individual phone results, subtopics with at least four significant individual phone correlations were identified, and corresponding phone durations were linearly combined into a measure. Positive or negative unit weights were chosen based on the sign of their individual phone correlation values. More formally, denote the average length of phone k by L k and suppose that a subset P i is the set of significantly correlated average phone lengths for HAMD sub-topic i. We then define a new variable L i as the sign-weighted sum where the weighting coefficients α k are ±1, defined by the sign of the relevant phone correlation. The full feature extraction process, from speech to the final linearly combined duration measure, is outlined in Figure 4.
Through this simple linear combination of a few phone-specific length features, we achieved much higher correlations than when examining average measures of the speech (i.e., globally), and, as before, the highest correlation is reached by the HAMD Psychomotor Retardation sub-topic.
The resulting correlation between the weighted sum of the individual phone lengths and the relevant score is shown in Table 3. The left-most column gives the set of phones used for each sub-topic (selected based on correlation significance). We observe that our largest correlations thus far are reached by our "optimally" selected composite phone lengths with each sub-topic. The largest correlation of the composite phone lengths is again reached by the HAMD Psychomotor Retardation measure with a value of 0.58, although the gain in correlation value from 0.47 (achieved with /t/) to 0.58 is small, considering the large number of phones that contribute to the composite feature (19 phone durations and pause/silence duration). In contrast, for the HAMD Work and Activities sub-topic, a correlation gain from 0.28 (/ih/) to 0.39 (/sil/, /aa/, /ih/, /ow/, /eh/, /s/) is achieved using only 6 phone lengths in the composite feature.
An alternative view of the correlation results of Table  3 is shown in Figure 5. In the figure, we display a comparison between the highest individual phone correlation and the composite length feature correlation values taken from Table 3. Significant correlations with global speaking rate (from Table 1) are included for comparison.

Phone-specific spread measurement
An alternative definition of phone duration was constructed using the concept of the spread of a signal's energy. A large subset of our phones consist of a single, continuous release of energy with tapered onset and offsets, particularly the case with burst consonants (e.g., /p/, /b/, etc.) and vowel onsets and offsets. (See Figure   6, for example.) In these cases, phone boundaries, as deduced from an automatic phone recognizer, may not provide an appropriate measure of phone duration. One measure of phone length or duration is given by the signal spread about the centroid of the envelope of a signal [22]. The centroid of the phone utterance, denoted e[n], is computed via a weighted sum of the signal. Specifically, the centroid for each phone utterance, n centroid , is given by where the square of the signal is normalized to have unit energy, and N is the number of samples in each phone utterance. The standard deviation about n centroid is used as the "spread" (i.e., alternate duration) feature.  Significant spread-based phone length correlations are illustrated in Figure 7 for both HAMD total and subtopic ratings. We see again that HAMD Psychomotor Retardation stands out with a large set of significant positive correlations with phone duration, indicating longer durations with worsening of the condition. HAMD Insomnia Middle of the Night shows consistent shortening of phone duration with increasing severity ratings. This consistency with the recognition-based length results is a product of the strong correlation between our recognition and spread-based measures. We see that overall, there are more changes in the correlation results with burst consonants, such as /k/, /g/, and /p/, than with any other phones due to their burst-like,  Figure 5 Absolute Spearman correlation value between measure and HAMD score. The individual phone correlation bars correspond to the maximum absolute correlation between depression assessment score and a single phone-specific average length; the specific phone used is shown at each bar. The phone combination correlation bars show the absolute correlation value between assessment score and the signed aggregate phone length; the phones used for this aggregate length are listed in the first column of Table 3. Global speaking rate correlation values from Table 1 are included for comparison. shorter nature in time. As seen in Figure 6, the phone recognition algorithm showed a tendency to overestimate (set too early) the onset phone boundary for these burst consonants; on the other hand, the duration of the silence gap before or after the burst may also be condition dependent.

Effects of noise and sub-topic intercorrelation
One of the more general relationships, which can be drawn from these data, is that worsening of the psychomotor retardation condition can be observed in a subject's speech rate. A question we can then ask is "Are the correlations between our speech measures and the other sub-topics the result of noise and/or sub-topic intercorrelation with the Psychomotor Retardation subtopic?" In order to alleviate the effects of spurious correlations on our interpretation, in addition to only showing significant results, the presentation of the results in Figures 3 and 7, as well subsequent related figures, is such that phones are grouped according to manner of articulation and the sub-topics are grouped by significant absolute intercorrelation values. Clustering of significant correlations within a phonetic or intercorrelation sub-group suggests that these consistent correlations are indeed meaningful. For further applications, one needs to know which correlation results are the product of strong intercorrelation between each sub-topic and Psychomotor Retardation and which are not. To help address this issue, although this likely deserves a more in-depth analysis, an additional experiment was run where the correlations between sub-topics and phone length were re-computed using only the speaker-session samples that had a Psychomotor Retardation score of 0 (i.e., no recorded psychomotor retardation). The results are shown in Figure  8, and we observe that, for sub-topics that are strongly correlated with Psychomotor Retardation, such as Agitation and Work-Activities (see Figure 2), the correlation patterns do change and most of the significant Figure 6 Example of a single utterance of the burst consonant /t/ where the boundaries detected by the automatic phone recognizer are greater than the phone duration corresponding to energy spread. Asterisk and cross markers show our estimated centroid and spread boundaries for this phone.  correlations found earlier are no longer present. For sub-topics that have a weak correlation to Psychomotor Retardation, such as Suicide or General Symptoms, we observe that many of the previous significant correlations found with phone length remain the same. In addition, we see that, for all correlations that are retained with this second analysis, there is no change in sign, further supporting the hypothesis that these correlations are not spurious or completely due to intercorrelations with Psychomotor Retardation.

Phone recognition accuracy
As mentioned earlier, the phone recognition algorithm is based on a Hidden Markov Model approach, which for English was reported as having about an 80% overall accuracy [19]. Although this implies some mislabeling of phones, the mislabeling is often between similarly structured (i.e., similar in time and frequency) phones. The primary effect of labeling errors is a form of added "noise" to our correlation studies and the feature vectors in Sections 5 and 6. In spite of this noise presence, we found strong correlations with phone-specific length features, with these feature results being supported by the preliminary classification work of Section 6. Nevertheless, a more quantitative study of the effect of phone mislabeling is warranted.

Classifiers of MDD: preliminary results
The correlation results obtained in this study motivate the development of automatic classifiers of depression severity based on phone-specific measures of speech rate. Feedback from a reliable classifier would be a highly beneficial tool for clinicians. Reliable classifiers could even be used as a tool to aid in the standardization of depression ratings. As an initial step to realize this aim, we provide a proof-of-concept use of speech rate features, specifically, the set of recognition-derived, phone-specific lengths, for classification. A more exhaustive classification study requires a larger, more comprehensive database and investigation of the broader suite of speech-rate features, such as the phone length from energy spread or signal power; we shall address this in our ongoing study, Section 7.
In forming depression classifiers, we consider the 5class problem for the HAMD total score; the 5-class case is divided into the ranges 0-5, 6-10, 11-15, 16-20, and 21-27. A 5-class experiment demonstrates a test of classification accuracy. For the symptom sub-topics, we implemented the 3, 4, or 5-class problem for each subtopic based on the maximum possible range for each; for example, the HAMD Mood sub-topic has the possible scores of 0, 1, 2, 3, or 4, and thus, we implemented a 5-class problem for this sub-topic. For all the classifiers considered, we tested using a leave-one-out cross validation scheme, as illustrated schematically for a 2-class case in Figure 9.
We use a simple Gaussian maximum-likelihood algorithm for all the experiments; i.e., each class is modeled as a multi-dimensional Gaussian, with the number of dimensions matching the feature vector dimension, and classification is then performed by finding the class of maximum likelihood for the test sample [23]. Our phonological feature vector is composed from our recognition-derived average phone (vowels and consonants) lengths (see Section 5.1) and the average pause (silence) length values. We consider four different feature selection methods: (1) A single feature, the signed aggregate of individual phone lengths and pause length-see Table  3, column 1, for a selection of phones used (Signed Agg); (2) No feature selection, i.e., use all the individual average phone lengths and/or the pause length as a vector of features (None); (3) Hand-selection of the subset of individual phone lengths and/or pause length, which show significant correlation statistics, to form a feature vector (Stat Sig); and (4) A subset of individual phone lengths and/or the pause length is automatically selected to minimize error, though an optimal solution is not guaranteed (min error) [23].
Providing classification results on the symptom subtopics would add an additional level of feedback to a clinician. In addition, considering each rating level as a class takes into account the fact that variations on a single-point scale could indicate large changes in an individual's condition. We therefore examine each subtopic as a 3, 4, or 5-class problem, with the number of classes matching the range of possible scores for each particular sub-topic. We also divide the total scores into a 5-class problem to test the classifier's ability to differentiate among in remission, mild, moderate, severe, or very severe depression. We found that most of the classification errors come from misclassification into an adjacent severity level; for example, a severity rating of 1 for a given sub-topic might be misclassified as a 0 or a 2. These results are summarized in Figure 10, which shows the (average-adjusted) c RMSE for each individual assessment rating. The RMSE provides a sense as to how far the classifier diverges from the clinician rating; all of the RMSEs fall below 2, quantifying our observation that most misclassifications fall into an adjacent severity level. In almost all cases, we benefit from some form of feature reduction; features that were handselected from the correlation results overlap, but do not exactly match the features that are chosen by the algorithm to minimize error. Finally, the RMSEs indicate the predictive potential of our phonologically-based feature sets including the single feature of linearly combined duration.
As we are using only a subset of our speech-rate features, the recognition-derived average phone lengths and the average pause length, one could potentially improve performance by extending the feature space beyond what is used in this preliminary study. Specifically, we have not used signal power and spread-based features, not to Figure 9 Illustration of the leave-one-out cross-validation approach for the 2-class problem, depicted as green versus blue. Each unique subject-session pair in our dataset is an "observed sample" that is described by its feature vector. For cross validation, we take one sample out, train the classifier on the remaining samples, classify the excluded sample and record the performance. The process is repeated until all of the observed samples have been tested.
recognition-based phone boundary definitions. The phone-and symptom-specific correlation patterns present a visual interpretation of how speech can change with different symptom severities. Possibly, speech sounds with either similar production categories or similar usages in speech (e.g., at the onset or at the ending of a word) would show correspondingly similar changes with MDD condition severity; we explored the former by grouping the phones by manner of articulation and finding consistencies in the correlations within the groups. Other experiments that indicated not all meaningful sub-topic correlations are tied to Psychomotor Retardation involved correlations between sub-topics and phone length re-computed using only the speakersession samples that had a Psychomotor Retardation score of 0. The additional correlation study with the linearly combined phone duration measure shows how using only a subset of phones can reveal a stronger underlying relationship.
Our correlation results show a snapshot as to how speech can vary across each individual symptom severity. Another possibility that we considered is that subtopics with similar or correlated symptoms would show similarities in the shift in speech rate and phone-specific duration measures. The similarities between symptom sub-topics are quantified by the intercorrelations shown in Figure 2. As an illustrative analysis, we examined the Psychomotor Retardation sub-topic which is most strongly correlated with Agitation (negatively, -0.40) and Mood (positively, 0.36). Keeping this in mind, we see in Figure 3 oppositely signed significant correlations for both Psychomotor Retardation and Agitation for two phones (/aw/, and /t/); we also see positive significant correlations for both Psychomotor Retardation and Mood for the same five phones and the pause measure (/sil/, /s/, /g/, /k/, /t/, and /ng/). The strongest HAMD intercorrelation for our dataset falls at 0.64 and corresponds to the correlation between the Mood and Work-Activities sub-topics. Although the correlation patterns for these phonologically-based measures share some characteristics, they are not the same, indicating that the two sub-topics are somewhat distinct.
We have also introduced a preliminary study for classification of depression severity based on our speechrate features using phone length derived from phonerecognition boundaries. Using a simple Gaussian-likelihood classifier, we showed the results for the 3, 4, or 5class classification problem for all HAMD score categories, with each class representing a different severity level. Our preliminary classification results show promise as a beneficial tool to the clinician, both as an initial measure of depression level and in assessing severity of symptoms, and motivate the extension of the study to further phone-based features.
Depression does not have the same symptom progression in all patients and should not be treated as such. Our correlation and classification results with the HAMD MDD assessment reveals changes that occur in speech rate with different symptom severities. Some symptoms, such as Psychomotor Retardation, have a consistent relationship with a change in speech pattern, while others, such as short-term changes in Weight, may not. Identifying reliable biomarkers for each symptom is useful, since each symptom category and progression to different severities is more homogeneous across patients than the overall depression rating, which can encompass completely different manifestations of the disorder.
In this article, we found significant correlations between a subset of the HAMD symptom sub-topic ratings and our vocal features, with supporting classification results. We found that a symptom-specific approach offers a more informative profile of a subject's state and is more likely to result in consistent shifts in speech pattern or behavior. For the total HAMD score, however, the case-by-case variability with which different sub-topics will increase in severity with worsening of MDD condition, and the sub-topic-specific relationships that we see with speech measures, suggests that one might not be able to expect a high HAMD total score to coexist with a reliable shift in a particular speech pattern. Each symptom sub-topic, when examined individually across its entire severity range, has unique and sometimes opposing shifts in speech rate measures.

Ongoing study
Based on the success of phone-specific speech rate measures in correlating with certain MDD symptoms, we plan to extend our experiments to examining other phone-specific speech measures, thus exploiting the general phonological framework that we have developed. Our ongoing studies include phone-specific energy measures, an examination of vowel usage in depression, and measures involving prosodic rhythm and modulation [24], and using the derivative of measures. The derivative of a vocal feature allows one to track how the changes in an individual's speech pattern may match similarly scaled changes in their condition. Use of derivatives also serves as a way to normalize out absolute levels in a subject's baseline speech.
As a taste of our on-going study, we cover a series of phone-based measures that extend the present results. We first discuss an alternative speech unit for computing speech rate, the pseudo-syllable rate. Individual phones are combined such that each vowel forms the nucleus of its own segment, with all of the proceeding consonants grouped with it. Thus, a measure of pseudo-syllable rate will be highly correlated to the phone rate results. The motivation for this unit is its relation to syllables and the difficulty in automatically extracting syllables [25]. The speaking and articulation rate, as defined in Section 4, were calculated with respect to the pseudo-syllable rate, and correlations with HAMD scores were computed. Similar to the phone rate results, the pseudo-syllable speaking rate shows significant correlation with the HAMD Psychomotor Retardation (-0.37) and total (-0.26), and the pseudo-syllable articulation rate shows highly significant correlation with the Psychomotor Retardation rating (-0.41).
Continuing our phone-based measures, we show in Figure 11 a correlation plot for individual phone average power. Phone power is computed as the sum of the squared signal over time. We see that the significant correlations with phone power are more uniform across phones within a sub-topic. Correlations with Psychomotor Retardation are negative for all phones and limited to mostly the vowel, approximant, and nasal phone categories.
In Figure 12, we show a plot comparing the individual phone length correlations of Figure 3 to the corresponding derivatives of the phone lengths. A rough derivative of the vocal features was computed by measuring the relative change between feature values on consecutive session days for each subject. The corresponding derivative of the depression ratings was computed in the same way. Comparing the derivatives results with the base value phone-specific correlations, there are no inconsistencies in the direction of length change with severity of condition; in other words, for all overlapping significant correlations, no positive correlation in one study is negative in the other.
In this article, we have only touched on classificationalgorithm development, illustrating the predictive potential of our phonologically-based features including the single feature of a simple linearly combined phone duration. We plan to extend this preliminary study using both more sophisticated classification schemes, such as the use of SVMs and a more comprehensive set of speech features, such as variations of our speech-rate measures, power, fundamental frequency measures, and temporal-and frequency-based rhythmic/modulation patterns. Along these lines, we will draw on prosodic tokenization approaches applied in other contexts [24,25].
We also touched on the issue of automatic phone recognition errors that can affect the accuracy of our speech-rate measures (see Section 5.4). We plan to further investigate the effect of these errors on our correlation and classification results. For example, the current phone recognizer [19] might be improved by invoking utterance transcriptions. Finally, we plan to explore the complementary use of other joint modalities, such as video tracking of facial features (e.g., visemes), that can yield biomarkers for certain symptoms or mental conditions that do not necessarily show in speech patterns.
More generally, we suspect that for other types of vocal features besides speech rate, the phone-specific approach, along with an individual MDD symptom analysis, will result in a more accurate representation of how speech can vary with different progressions of MDD.

Abbreviations
DALYs: disability-adjusted life years; MDD: major depressive disorder; MDE: major depressive episodes; RMSE: root mean squared error; SVM: support vector machine; YLD: years lost due to disability.