Level-wise aligned dual networks for text–video retrieval

Lin, Qiubin; Cao, Wenming; He, Zhiquan

doi:10.1186/s13634-022-00887-y

EURASIP Journal on Advances in Signal Processing

Table 1 Experimental results on MSR-VTT. We utilize three split methods as A [23], B [18], and C [19], respectively. Larger R@k, mAP, and lower Med R denote better performance

From: Level-wise aligned dual networks for text–video retrieval

Method	Split	Text-to-Video Retrieval					Video-to-Text Retrieval					SumR
		R@1	R@5	R@10	Med R	mAP	R@1	R@5	R@10	Med R	mAP
Mithun et al.\(^*\) [9]	A [23]	7	20.9	29.7	38	-	12.5	32.1	42.4	16	-	144.6
TCE\(^*\) [14]		7.7	22.5	32.1	30	-	-	-	-	-	-	-
HGR\(^*\) [15]		9.2	26.2	36.5	24	-	15	36.7	48.8	11	-	172.4
CE\(^*\) [13]		10	29	41.2	16	-	15.6	40.9	55.2	8.3	-	191.9
W2VV [42]		1.1	4.7	8.1	236	3.7	17	37.9	49.1	11	7.6	117.9
MEE [18]		6.8	20.7	31.1	28	14.7	13.4	32	44	14	6.6	148
CE [13]		7.9	23.6	34.6	23	16.5	11	31.9	46.1	13	6.8	155.1
VSE++ [52]		8.7	24.3	34.1	28	16.9	15.6	36.6	48.6	11	7.4	167.9
TCE [14]		9.3	27.3	38.6	19	18.7	15.1	36.8	50.2	10	8	177.3
W2VV++ [20]		11.1	29.6	40.5	18	20.6	17.5	40.2	52.5	9	8.5	191.4
HGR [15]		11.1	30.5	42.1	16	20.8	18.7	44.3	57.6	7	9.9	204.4
Dual Encoding [22]		11.6	30.3	41.3	17	21.2	22.5	47.1	58.9	7	10.5	211.7
LADN		12.9	33.6	45.3	14	23.3	22.2	47.4	60.3	6	11.5	221.7
Dual Encoding\(^\star\) [22]		12.1	31.4	42.9	16	22.0	21.3	45.6	58.1	7	10.4	211.3
LADN\(^\star\)		13.1	33.9	45.4	14	23.4	23.0	48.1	60.5	6	11.5	224.0
JPoSE\(^*\) [8]	B [18]	14.3	38.1	53	9	-	16.4	41.3	54.4	8.7	-	217.5
MEE\(^*\) [18]		16.8	41	54.4	9	-	-	-	-	-	-	-
TCE\(^*\) [14]		17.1	39.9	53.7	9	-	-	-	-	-	-	-
CE\(^*\) [13]		18.2	46	60.7	7	-	18	46	60.3	6.5	-	249.2
W2VV [42]		2.7	12.5	17.3	83	7.9	17.3	42	53.5	9	29.3	145.3
MEE [18]		15.7	39	52.3	9	27.1	15.3	41.9	54.5	8	28.1	218.7
VSE++ [52]		17	40.9	52	10	16.9	18.1	40.4	52.1	9	29.2	220.5
CE [13]		17.8	42.8	56.1	8	30.3	17.4	42.9	56.1	8	29.8	233.1
TCE [14]		17	44.7	58.3	7	30	15.1	43.3	58.2	7	28.3	236.6
W2VV++ [20]		21.7	48.6	60.9	6	34.4	18.6	46.4	59.1	6	31.7	255.3
HGR [15]		22.9	50.2	63.6	5	35.9	20	48.3	60.9	6	33.2	265.9
Dual Encoding [22]		23	50.6	62.5	5	36.1	25.1	52.1	64.6	5	37.7	277.9
LADN		25.5	52.9	66.9	5	38.6	25.3	55.2	66.7	4	39.3	292.5
Dual Encoding\(^\star\) [22]		23.1	51.2	62.6	5	35.9	24.1	52.2	63.6	5	37.18	276.8
LADN\(^\star\)		26.6	55.5	66.9	4	39.9	26.9	55.0	67.4	4	40.1	298.3
JSFusion\(^*\) [19]		10.2	31.2	43.2	13	-	-	-	-	-	-	-
TCE\(^*\) [14]		16.1	38	51.5	10	-	-	-	-	-	-	-
Miech et al.\(^*\) [7]		14.9	40.2	52.8	9	-	-	-	-	-	-	-
CE\(^*\) [13]		20.9	48.8	62.4	6	-	20.6	50.3	64	5.3	-	267
W2VV [42]		1.9	9.9	15.2	79	6.8	17.3	39.3	50.2	10	27.8	133.8
VSE++ [52]		16	38.5	50.9	10	27.4	16.2	39.3	51.2	10	27.4	212.1
MEE [18]		14.6	38.4	52.4	9	26.1	15.2	40.9	53.8	9	27.9	215.3
W2VV++ [20]		19	45	58.7	7	31.8	16.9	42.7	54.6	8	29	236.9
CE [13]		17.2	46.2	58.5	7	30.3	15.8	44.9	59.2	7	30.4	241.8
TCE [14]		17.8	46	58.3	7	31.1	18.9	43.5	58.8	7	31.4	243.3
HGR [15]	C [19]	21.7	47.4	61.1	6	34	20.4	47.9	60.6	6	33.4	259.1
Dual Encoding [22]		21.1	48.7	60.2	6	33.6	21.7	49.4	61.6	6	34.7	262.7
LADN		24.4	52	63.4	5	37.4	23.6	50.8	62.8	5	36.6	277.0
Dual Encoding\(^\star\) [22]		21.9	48.1	61.5	6	34.5	22.3	48	61.6	6	34.6	263.4
LADN\(^\star\)		24.6	52.5	64.0	5	37.5	22.5	53.0	65.1	5	36.3	281.7

* denotes results directly cited from the original papers, * denotes numbers obtained by training given the average ResNeXt-ResNet representation, and the others are obtained by training given the concatenated ResNeXt-ResNet representation

Back to article page