Evaluation of a speaker identification system with and without fusion using three databases in the presence of noise and handset effects

EURASIP Journal on Advances in Signal Processing

Table 1 Parameters and setup used in all experiments and simulations

Aspects	Parameters and experimental setup
Sampling frequency	16000
Window type	Hamming
Frame length	16 ms
Frame shift	8 ms
Pre-emphasis factor	0.96
Databases	and
Number of speakers	120 speakers for each database, total 360 speakers for all databases
Total speech utterances used	1200 for each database, total 3600 for all databases
Language	English
Data source (s)	Microphone speech for TIMIT and NIST 2008,
	Hand annotated speech from open source media for SITW
No. of samples per speaker	10 for TIMIT, 10 created
	as well for both SITW and NIST 2008
Testing samples for each database	Total 480 utterances
Training samples for each database	Total 720 utterances
Dialect region	We selected DR1 and DR4 from TIMIT to mirror our previous studies
	49 DR1&71 DR4 for TIMIT database
Average sample duration	8 s
	(for each speech utterance in both training and testing);
	All speech samples were taken with fixed length;
	concatenation is applied where necessary
Features	MFCC and PNCC
Feature vector dimension	16
Feature normalization	Feature warping (FW) and
	Cepstral mean variance normalization (CMVN)
Modeling	GMM-UBM
Classifier	LLR
GMC (mixtures)	{8, 16, 32, 64, 128, 256, 512 }
Fusion types	Late fusion:
	Mean, linear weights, maximum
System environment	Clean, AWGN with G.712 type handset at 16 kHz and
	(street-traffic, bus-interior, and crowd talking NSN) with handset
SNR levels in dB	{0, 5, 10, 15, 20, 25, 30}

The colored data reflected three different databases and the highest SIA for each database: red for TIMIT, blue for SITW and Violet for NIST 2008 database