Text-independent speaker recognition based on adaptive course learning loss and deep residual network

EURASIP Journal on Advances in Signal Processing

Table 3 VoxCeleb1 and VoxCeleb2 dataset

Layer name	Kernel size	Strides	Output size
Conv1	7 ×7,32	1 ×1	(32,64,200)
Res1	\(\left [\begin {array}{lllll} 3\times 3,32\\ 3\times 3,32\end {array}\right ]\times 3\)	1×1	(32,64,100)
Conv2	1 ×1,64	2 ×2	(64,32,100)
Res2	\(\left [\begin {array}{lllll}3\times 3,64\\ 3\times 3,64\end {array}\right ]\times 4\)	1×1	(64,32,100)
Conv3	1 ×1,128	2 ×2	(128,16,50)
Res3	\(\left [\begin {array}{lllll}3\times 3,128\\ 3\times 3,128\end {array}\right ]\times 6\)	1×1	(128,16,50)
Conv4	1 ×1,256	2 ×2	(256,8,25)
Res4	\(\left [\begin {array}{lllll}3\times 3,256\\ 3\times 3,256\end {array}\right ]\times 3\)	1×1	(256,8,25)
Reshape	-	-	(2048,25)
CASP	1×512	1×1	(512)
FC	-	-	(512)