Fig. 2From: Level-wise aligned dual networks for text–video retrievalThe framework of our proposed LADN method. LADN first utilizes multi-level encoders to extract global, temporal, local, and spatial–temporal information in videos and text. Then, they are mapped into four different latent spaces and a semantic space. Finally, LADN aligns different levels of information in various spacesBack to article page