From: Level-wise aligned dual networks for text–video retrieval
LADN variants | Text-to-Video retrieval | Video-to-Text retrieval | SumR | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
R@1 | R@5 | R@10 | Med R | mAP | R@1 | R@5 | R@10 | Med R | mAP | ||
original LADN | 26.6 | 55.5 | 66.9 | 4 | 39.9 | 26.9 | 55.0 | 67.4 | 4 | 40.1 | 298.3 |
w/o g, t, l alignments | 24.2 | 52.6 | 61.2 | 5 | 37.0 | 25.3 | 52.3 | 62.9 | 5 | 38.1 | 278.5 |
w/o semantic space | 25.3 | 53.8 | 64.3 | 5 | 38.2 | 27.0 | 52.8 | 64.3 | 5 | 39.5 | 287.5 |
w/ g, t, l semantic spaces | 26.2 | 54.2 | 66.6 | 5 | 39.0 | 26.4 | 54.5 | 66.4 | 4 | 39.8 | 294.3 |
w/o g alignment | 25.1 | 55.6 | 66.3 | 4 | 38.9 | 25.3 | 54.8 | 67.0 | 4 | 39.4 | 294.1 |
w/o l alignment | 24.9 | 53.1 | 64.6 | 5 | 38.1 | 25.5 | 54.7 | 65.8 | 5 | 39.1 | 288.6 |
w/o t alignment | 25.6 | 56.2 | 66.4 | 4 | 39.2 | 26.4 | 54.8 | 66.6 | 4 | 39.9 | 296.0 |
w/o s_t alignment | 25.9 | 53.3 | 66.2 | 5 | 38.9 | 26.0 | 55.1 | 66.2 | 4 | 39.7 | 292.7 |