Aspects | Parameters and experimental setup |
---|---|
Sampling frequency | 16000 |
Window type | Hamming |
Frame length | 16 ms |
Frame shift | 8 ms |
Pre-emphasis factor | 0.96 |
Databases | and |
Number of speakers | 120 speakers for each database, total 360 speakers for all databases |
Total speech utterances used | 1200 for each database, total 3600 for all databases |
Language | English |
Data source (s) | Microphone speech for TIMIT and NIST 2008, |
 | Hand annotated speech from open source media for SITW |
No. of samples per speaker | 10 for TIMIT, 10 created |
 | as well for both SITW and NIST 2008 |
Testing samples for each database | Total 480 utterances |
Training samples for each database | Total 720 utterances |
Dialect region | We selected DR1 and DR4 from TIMIT to mirror our previous studies |
 | 49 DR1&71 DR4 for TIMIT database |
Average sample duration | 8 s |
 | (for each speech utterance in both training and testing); |
 | All speech samples were taken with fixed length; |
 | concatenation is applied where necessary |
Features | MFCC and PNCC |
Feature vector dimension | 16 |
Feature normalization | Feature warping (FW) and |
 | Cepstral mean variance normalization (CMVN) |
Modeling | GMM-UBM |
Classifier | LLR |
GMC (mixtures) | {8, 16, 32, 64, 128, 256, 512 } |
Fusion types | Late fusion: |
 | Mean, linear weights, maximum |
System environment | Clean, AWGN with G.712 type handset at 16 kHz and |
 | (street-traffic, bus-interior, and crowd talking NSN) with handset |
SNR levels in dB | {0, 5, 10, 15, 20, 25, 30} |