Level-wise aligned dual networks for text–video retrieval

EURASIP Journal on Advances in Signal Processing

Table 5 Ablation Experiments on MSR-VTT B partition [18]. w/ and w/o mean with and without, respectively. g, t, l, s_t denote global, temporal, local, spatial–temporal, respectively

LADN variants	Text-to-Video retrieval					Video-to-Text retrieval					SumR
LADN variants	R@1	R@5	R@10	Med R	mAP	R@1	R@5	R@10	Med R	mAP	SumR
original LADN	26.6	55.5	66.9	4	39.9	26.9	55.0	67.4	4	40.1	298.3
w/o g, t, l alignments	24.2	52.6	61.2	5	37.0	25.3	52.3	62.9	5	38.1	278.5
w/o semantic space	25.3	53.8	64.3	5	38.2	27.0	52.8	64.3	5	39.5	287.5
w/ g, t, l semantic spaces	26.2	54.2	66.6	5	39.0	26.4	54.5	66.4	4	39.8	294.3
w/o g alignment	25.1	55.6	66.3	4	38.9	25.3	54.8	67.0	4	39.4	294.1
w/o l alignment	24.9	53.1	64.6	5	38.1	25.5	54.7	65.8	5	39.1	288.6
w/o t alignment	25.6	56.2	66.4	4	39.2	26.4	54.8	66.6	4	39.9	296.0
w/o s_t alignment	25.9	53.3	66.2	5	38.9	26.0	55.1	66.2	4	39.7	292.7