Skip to main content

Table 3 VoxCeleb1 and VoxCeleb2 dataset

From: Text-independent speaker recognition based on adaptive course learning loss and deep residual network

Layer name Kernel size Strides Output size
Conv1 7 ×7,32 1 ×1 (32,64,200)
Res1 \(\left [\begin {array}{lllll} 3\times 3,32\\ 3\times 3,32\end {array}\right ]\times 3\) 1×1 (32,64,100)
Conv2 1 ×1,64 2 ×2 (64,32,100)
Res2 \(\left [\begin {array}{lllll}3\times 3,64\\ 3\times 3,64\end {array}\right ]\times 4\) 1×1 (64,32,100)
Conv3 1 ×1,128 2 ×2 (128,16,50)
Res3 \(\left [\begin {array}{lllll}3\times 3,128\\ 3\times 3,128\end {array}\right ]\times 6\) 1×1 (128,16,50)
Conv4 1 ×1,256 2 ×2 (256,8,25)
Res4 \(\left [\begin {array}{lllll}3\times 3,256\\ 3\times 3,256\end {array}\right ]\times 3\) 1×1 (256,8,25)
Reshape - - (2048,25)
CASP 1×512 1×1 (512)
FC - - (512)