Skip to main content

Table 1 Parameters and setup used in all experiments and simulations

From: Evaluation of a speaker identification system with and without fusion using three databases in the presence of noise and handset effects

Aspects

Parameters and experimental setup

Sampling frequency

16000

Window type

Hamming

Frame length

16 ms

Frame shift

8 ms

Pre-emphasis factor

0.96

Databases

and

Number of speakers

120 speakers for each database, total 360 speakers for all databases

Total speech utterances used

1200 for each database, total 3600 for all databases

Language

English

Data source (s)

Microphone speech for TIMIT and NIST 2008,

 

Hand annotated speech from open source media for SITW

No. of samples per speaker

10 for TIMIT, 10 created

 

as well for both SITW and NIST 2008

Testing samples for each database

Total 480 utterances

Training samples for each database

Total 720 utterances

Dialect region

We selected DR1 and DR4 from TIMIT to mirror our previous studies

 

49 DR1&71 DR4 for TIMIT database

Average sample duration

8 s

 

(for each speech utterance in both training and testing);

 

All speech samples were taken with fixed length;

 

concatenation is applied where necessary

Features

MFCC and PNCC

Feature vector dimension

16

Feature normalization

Feature warping (FW) and

 

Cepstral mean variance normalization (CMVN)

Modeling

GMM-UBM

Classifier

LLR

GMC (mixtures)

{8, 16, 32, 64, 128, 256, 512 }

Fusion types

Late fusion:

 

Mean, linear weights, maximum

System environment

Clean, AWGN with G.712 type handset at 16 kHz and

 

(street-traffic, bus-interior, and crowd talking NSN) with handset

SNR levels in dB

{0, 5, 10, 15, 20, 25, 30}

  1. The colored data reflected three different databases and the highest SIA for each database: red for TIMIT, blue for SITW and Violet for NIST 2008 database