Skip to main content

Table 5 Ablation Experiments on MSR-VTT B partition [18]. w/ and w/o mean with and without, respectively. g, t, l, s_t denote global, temporal, local, spatial–temporal, respectively

From: Level-wise aligned dual networks for text–video retrieval

LADN variants

Text-to-Video retrieval

Video-to-Text retrieval

SumR

R@1

R@5

R@10

Med R

mAP

R@1

R@5

R@10

Med R

mAP

original LADN

26.6

55.5

66.9

4

39.9

26.9

55.0

67.4

4

40.1

298.3

w/o g, t, l alignments

24.2

52.6

61.2

5

37.0

25.3

52.3

62.9

5

38.1

278.5

w/o semantic space

25.3

53.8

64.3

5

38.2

27.0

52.8

64.3

5

39.5

287.5

w/ g, t, l semantic spaces

26.2

54.2

66.6

5

39.0

26.4

54.5

66.4

4

39.8

294.3

w/o g alignment

25.1

55.6

66.3

4

38.9

25.3

54.8

67.0

4

39.4

294.1

w/o l alignment

24.9

53.1

64.6

5

38.1

25.5

54.7

65.8

5

39.1

288.6

w/o t alignment

25.6

56.2

66.4

4

39.2

26.4

54.8

66.6

4

39.9

296.0

w/o s_t alignment

25.9

53.3

66.2

5

38.9

26.0

55.1

66.2

4

39.7

292.7