Skip to main content

Table 1 Experimental results on MSR-VTT. We utilize three split methods as A [23], B [18], and C [19], respectively. Larger R@k, mAP, and lower Med R denote better performance

From: Level-wise aligned dual networks for text–video retrieval

Method

Split

Text-to-Video Retrieval

Video-to-Text Retrieval

SumR

  

R@1

R@5

R@10

Med R

mAP

R@1

R@5

R@10

Med R

mAP

 

Mithun et al.\(^*\) [9]

A [23]

7

20.9

29.7

38

-

12.5

32.1

42.4

16

-

144.6

TCE\(^*\) [14]

7.7

22.5

32.1

30

-

-

-

-

-

-

-

HGR\(^*\) [15]

9.2

26.2

36.5

24

-

15

36.7

48.8

11

-

172.4

CE\(^*\) [13]

10

29

41.2

16

-

15.6

40.9

55.2

8.3

-

191.9

W2VV [42]

1.1

4.7

8.1

236

3.7

17

37.9

49.1

11

7.6

117.9

MEE [18]

6.8

20.7

31.1

28

14.7

13.4

32

44

14

6.6

148

CE [13]

7.9

23.6

34.6

23

16.5

11

31.9

46.1

13

6.8

155.1

VSE++ [52]

8.7

24.3

34.1

28

16.9

15.6

36.6

48.6

11

7.4

167.9

TCE [14]

9.3

27.3

38.6

19

18.7

15.1

36.8

50.2

10

8

177.3

W2VV++ [20]

11.1

29.6

40.5

18

20.6

17.5

40.2

52.5

9

8.5

191.4

HGR [15]

11.1

30.5

42.1

16

20.8

18.7

44.3

57.6

7

9.9

204.4

Dual Encoding [22]

11.6

30.3

41.3

17

21.2

22.5

47.1

58.9

7

10.5

211.7

LADN

12.9

33.6

45.3

14

23.3

22.2

47.4

60.3

6

11.5

221.7

Dual Encoding\(^\star\) [22]

12.1

31.4

42.9

16

22.0

21.3

45.6

58.1

7

10.4

211.3

LADN\(^\star\)

13.1

33.9

45.4

14

23.4

23.0

48.1

60.5

6

11.5

224.0

JPoSE\(^*\) [8]

B [18]

14.3

38.1

53

9

-

16.4

41.3

54.4

8.7

-

217.5

MEE\(^*\) [18]

16.8

41

54.4

9

-

-

-

-

-

-

-

TCE\(^*\) [14]

17.1

39.9

53.7

9

-

-

-

-

-

-

-

CE\(^*\) [13]

18.2

46

60.7

7

-

18

46

60.3

6.5

-

249.2

W2VV [42]

2.7

12.5

17.3

83

7.9

17.3

42

53.5

9

29.3

145.3

MEE [18]

15.7

39

52.3

9

27.1

15.3

41.9

54.5

8

28.1

218.7

VSE++ [52]

17

40.9

52

10

16.9

18.1

40.4

52.1

9

29.2

220.5

CE [13]

17.8

42.8

56.1

8

30.3

17.4

42.9

56.1

8

29.8

233.1

TCE [14]

17

44.7

58.3

7

30

15.1

43.3

58.2

7

28.3

236.6

W2VV++ [20]

21.7

48.6

60.9

6

34.4

18.6

46.4

59.1

6

31.7

255.3

HGR [15]

22.9

50.2

63.6

5

35.9

20

48.3

60.9

6

33.2

265.9

Dual Encoding [22]

23

50.6

62.5

5

36.1

25.1

52.1

64.6

5

37.7

277.9

LADN

25.5

52.9

66.9

5

38.6

25.3

55.2

66.7

4

39.3

292.5

Dual Encoding\(^\star\) [22]

23.1

51.2

62.6

5

35.9

24.1

52.2

63.6

5

37.18

276.8

LADN\(^\star\)

26.6

55.5

66.9

4

39.9

26.9

55.0

67.4

4

40.1

298.3

JSFusion\(^*\) [19]

 

10.2

31.2

43.2

13

-

-

-

-

-

-

-

TCE\(^*\) [14]

 

16.1

38

51.5

10

-

-

-

-

-

-

-

Miech et al.\(^*\) [7]

 

14.9

40.2

52.8

9

-

-

-

-

-

-

-

CE\(^*\) [13]

 

20.9

48.8

62.4

6

-

20.6

50.3

64

5.3

-

267

W2VV [42]

 

1.9

9.9

15.2

79

6.8

17.3

39.3

50.2

10

27.8

133.8

VSE++ [52]

 

16

38.5

50.9

10

27.4

16.2

39.3

51.2

10

27.4

212.1

MEE [18]

 

14.6

38.4

52.4

9

26.1

15.2

40.9

53.8

9

27.9

215.3

W2VV++ [20]

 

19

45

58.7

7

31.8

16.9

42.7

54.6

8

29

236.9

CE [13]

 

17.2

46.2

58.5

7

30.3

15.8

44.9

59.2

7

30.4

241.8

TCE [14]

 

17.8

46

58.3

7

31.1

18.9

43.5

58.8

7

31.4

243.3

HGR [15]

C [19]

21.7

47.4

61.1

6

34

20.4

47.9

60.6

6

33.4

259.1

Dual Encoding [22]

 

21.1

48.7

60.2

6

33.6

21.7

49.4

61.6

6

34.7

262.7

LADN

 

24.4

52

63.4

5

37.4

23.6

50.8

62.8

5

36.6

277.0

Dual Encoding\(^\star\) [22]

 

21.9

48.1

61.5

6

34.5

22.3

48

61.6

6

34.6

263.4

LADN\(^\star\)

 

24.6

52.5

64.0

5

37.5

22.5

53.0

65.1

5

36.3

281.7

  1. * denotes results directly cited from the original papers, * denotes numbers obtained by training given the average ResNeXt-ResNet representation, and the others are obtained by training given the concatenated ResNeXt-ResNet representation