Level-wise aligned dual networks for text–video retrieval

The vast amount of videos on the Internet makes efficient and accurate text–video retrieval tasks increasingly important. The current methods leverage a high-dimensional space to align video and text for these tasks. However, a high-dimensional space cannot fully use different levels of information in videos and text. In this paper, we put forward a method called level-wise aligned dual networks (LADNs) for text–video retrieval. LADN uses four common latent spaces to improve the performance of text–video retrieval and utilizes the semantic concept space to increase the interpretability of the model. Specifically, LADN first extracts different levels of information, including global, local, temporal, and spatial–temporal information, from videos and text. Then, they are mapped into four different latent spaces and one semantic space. Finally, LADN aligns different levels of information in various spaces. Extensive experiments conducted on three widely used datasets, including MSR-VTT, VATEX, and TRECVID AVS 2016-2018, demonstrate that our proposed approach is superior to several state-of-the-art text–video retrieval approaches.

are limited and unstructured, they cannot accurately search for different fine-grained contents and utilize temporal information. For example, "a dog chases a cat" and "a cat chases a dog" will have the same semantic concepts, while the order of objects in the caption is potentially significant. In addition, a query of "a black dog chases a white cat" is nearly impossible to obtain satisfied retrieval results for a semantic-based video retrieval method. Although the semantic-based method has certain interpretability, how to specify a set of relevant and detectable semantic concepts for video and text features remains unsolved.
To solve the limitations of semantic-based methods, researchers pay more attention to utilizing original sentences that contain rich contextual information than semantic concepts. At present, the main methods for text-video cross-modal retrieval map video and text into a common latent space, where the cross-modal similarity can be measured.
For text representations, bag of words remains popular [16,17], while deep networks are in increasing use. For each word of a sentence, a dense vector is first generated by multiplying its one-hot vector with a pre-trained word embedding matrix. Then, they are combined to generate a sentence-level representation by NetVLAD [8,13], max pooling [7], Fisher Vector [18], RNN [9,19], or graph convolutional network [15].
W2VV++ [20] leverages three text representations, including bag of words, word-2vec, and gated recurrent unit (GRU), to form a high-dimensional sentence-level representation. Nevertheless, W2VV++ only utilizes the meaning pooling strategy over video frames. Dong et al. [21,22] utilize a multi-level encoding strategy to extract multiple video representations and combine them as a final video representation. Although common space learning methods give superior performance to semanticbased methods, each dimension of the common latent space lacks interpretability. Combining the advantages of the common latent space and the semantic concept space can improve the cross-modal retrieval performance and increase the interpretability of the model. Furthermore, unlike Dual Encoding [22], we measure multiple similarities in different latent spaces in addition to taking advantage of common latent space and semantic concept space.
In this paper, we make the following contributions.
• We design a level-wise aligned mechanism to align representations between videos and sentences in different levels. Specifically, we first exploit multi-level encoders to extract global, local, temporal, and spatial-temporal information in videos and text, respectively. Then, they are mapped into four different latent spaces and one semantic space. • We combine the advantages of common latent space and semantic concept space to improve the cross-modal retrieval performance and increase the interpretability of our model. Specifically, we average four cross-modal similarities of different levels in four different latent spaces. Then, we combine it with the similarity in the semantic concept space. • Extensive experiments are conducted on three widely used datasets including MSR-VTT [23], VATEX [24], and TRECVID AVS 2016-2018 [25][26][27]. The experimental results of our approach give superior performance to the state-of-the-art approaches.
The rest of this paper is organized as follows. Some related work is introduced in Sect. 1.2. We present our proposed method and experimental setup in Sects. 2 and 3, respectively. Section 3.4 provides the experimental results. Finally, Sect. 4 concludes our work.

Related work
This section reviews some previous work on language and video representations learning, and video-text retrieval, including semantic space learning and latent space learning.

Language representations
Bag of words [28] and word2vec [29] are earlier work on text representations, which cannot capture the contextual information in a sentence. Long short-term memory network (LSTM) [30] is one of the first deep models to overcome this shortcoming. Recently, the transformer architecture [31] has given impressive performance in sentence representations by leveraging a self-attention mechanism, where every word in a sentence can focus on all other words. The transformer architecture consists of alternately stacked self-attention layers and fully connected layers, which form the basis of the popular language architecture BERT [32]. Burns et al. [33] leverage different word embeddings and language networks (LSTM, BERT, etc.) to analyze their performance in text-video tasks. They believe that the performance of the pre-trained and frozen BERT architecture is relatively poor than that of an average embedding architecture or the LSTM.

Video representations
A common method to extract video representations is to extract each keyframe representation by some pre-trained CNN models and then combine them by max pooling or mean pooling. However, these ways cannot attend the temporal information in the video. In order to incorporate spatial and temporal representations from still frames and motion between frames, Simonyan et al. [34] leverage a two-stream model to perform action recognition in videos. Besides, I3D [35] utilizes a two-stream inflated 3D Con-vNet to better attend the temporal information in a video. Xie et al. [36] proposed an alternative approach, which replaces 3D convolutions with spatial convolutions in 2D and temporal convolution in 1D. [37,38] create a concept vector for each test keyframe by concatenating 1000 ImageNet concepts and 345 TRECVID SIN concepts and translate a textual query to relevant predefined concepts by a set of complex linguistic rules. [39] builds a much larger semantic concept bank containing over 50,000 concepts by utilizing a pre-trained CNN architecture and support vector machines (SVMs). [40] recognizes ImageNet hierarchies to gain about 13k concepts and utilizes VideoStory [16] to generate semantic representations. Since semantic concepts are limited and unstructured, it is hard to represent the rich contextual information within both sentence and video. However, encoding video and sentences into concept vectors makes the model somewhat interpretable.

Latent space learning
The methods based on common latent space first extract representations from video and sentence, respectively, and then project them into a latent space, where the cross-modal similarity can be directly calculated. For these methods, what matter are how to extract rich representations from video and sentence separately and measure the video-text similarity. Therefore, we review recent progress from these three aspects. For video representations, a common method is to first extract frame representations from the video through some pre-trained CNN models and then combine them along the temporal dimension into a video-level representation by mean pooling [9,10,13,41,42] or max pooling [7,15,18].
Yang et al. [14] first leverage GRU to explore the temporal relationship between video keyframes and then use a self-attention mechanism to capture the representation interaction among keyframes. Additionally, [7,9,13] leverage motion features extracted from the I3D model [35], and audio features generated by the audio CNN model [43] as part of the visual representations. Nevertheless, these methods still leverage max pooling, mean pooling, or NetVLAD to combine various features into a single feature vector per video.
For text representations, word2vec models are widely used, which are pre-trained on large-scale text corpora. Specifically, for each word of a sentence, a dense vector is first generated by multiplying its one-hot vector with a pre-trained word embedding matrix. Then, they are combined by NetVLAD, max pooling, or Fisher Vector. Although they have achieved good performance, they cannot capture the sequential information in a sentence. Recurrent neural networks (RNNs) are effective in employing sequential information. Moreover, variants of RNN, such as LSTM, bidirectional LSTM, GRU, and bidirectional GRU, are utilized in [9,19,44,45], respectively. For example, in [9], the sentence representations are from the last hidden state of the GRU. [42] and [20] utilize three text representations, including BoW, word2vec, and GRU. However, these methods only leverage mean pooling to obtain the video representation. HGR [15] utilizes a hierarchical decomposition of a sentence to explore the relationship between words, which requires the sentence to be well annotated with certain linguistic rules. For video-text similarity learning, recent methods map video features and text features into a common latent space where the text-video cross-modal similarity can be computed by cosine similarity and leverage various triplet ranking losses to train their models. In addition to the triplet ranking loss, reconstruction loss and contrastive loss are utilized to learn the latent space in [46]. Recently, an increasing number of methods learn several latent spaces instead of just learning one latent space. Mixture of embedding experts (MEE) [18] computes the final similarity by a weighted combination between sentence and multiple video latent spaces, one for each input including motion, appearance, face, or audio representation. HGR [15] assumes a hierarchical decomposition of the video and text and projects them into three spaces, including events, actions, and entities.
Unlike the existing methods that learn semantic concept space or common latent space, our approach simultaneously learns these two spaces, which takes advantage of the interpretability of semantic concept space and the high performance of common latent space. We separately represent video and text as four complementary representations, including global, temporal, local, and spatial-temporal representation, and learn one common latent space for each representation. Besides, we also map spatialtemporal representation into a semantic space. Thus, our proposed method can align different levels of information in various spaces.

Methods
As illustrated in Fig. 2, we put forward an architecture, named level-wise aligned dual networks (LADNs), to improve the performance of text-video cross-modal retrieval.

Video encoder
Following the setting in [22], we extract n frames at 0.5 second intervals from a video. For each frame, we utilize a pre-trained ImageNet CNN to extract deep representation. Therefore, the video is represented as a sequence of frame representations where v t denotes the representation of the t− th frame.

Global-aware encoder
Given a video, we take the average of all frame features as f (v) 1 , which denotes a visual pattern repeatedly appearing in the video clip.

Temporal-aware encoder
We utilize a bidirectional gated recurrent units (BiGRU) [47] to extract temporal information from video frame features. The BiGRU consists of two GRU: One encodes frame features in a forward direction, and the other is backward. At a specific time step t, the hidden feature of the forward GRU is expressed as − → h t and the one of the backward GRU is expressed as ← − h t . We obtain the BiGRU output by averaging

Local-aware encoder
The temporal-aware feature cannot extract the subtle difference between each frame. Therefore, we leverage 1D CNN [48] following BiGRU to extract local-aware patterns in the video. The output of BiGRU is represented as is the input of 1D CNN. Conv1d k,r denotes 1D convolutional module including r filters of size k. The activation function of 1D CNN is ReLU. Next, we utilize max pooling to get a fixed length r. The above process can be expressed as follows, We set k = 2, 3, 4, 5 to generate multi-scale local-aware representations and concatenate them as f (v) 3 .

Spatial-temporal encoder
2 , and f (v) 3 naturally extract global, temporal, and local information in video content, respectively. We assume the three patterns are complementary to each other, with some redundant information. Therefore, we concatenate the three patterns as f (v) 4 , which captures spatial and temporal information in the video.

Text encoder
Similar to the encoding strategy for the video modality, the text modality also utilizes four different encoders to extract different information. Given a sentence s of length m, we utilize classical bag-of-words features to represent it. Let f (s) 1 = [w 1 , w 2 , ..., w m ] denote the global feature of a sentence, where w m denotes the number of occurrences of the m− th word.
We leverage a word embedding matrix to convert each word from a one-hot vector into a dense vector. The matrix was initialized by a word2vec model [42] trained on English tags of 30 million Flickr images. Next, we can obtain a temporal-aware representation f (s) 2 like that used by the temporal-aware encoder in the video. Similar to the video counterpart, we use four 1D CNN modules with k = 2, 3, 4, 5 to generate multi-scale representations. Their outputs are concatenated as local-aware fea- , which means the spatial and temporal feature of a sentence s.

Latent space learning
Given the video features in different levels, we transform them into four different latent spaces, respectively, as follows, where x ∈ {v, s} , i = 1, 2, 3, 4 , W i is the parameter of a fully connected layer, and b i is its bias item, and BN denotes a batch normalization layer. Then, we utilize cosine similarity sim_lat i (v, s) to calculate the video-text similarity between φ (v) i and φ (s) i . The improved triplet ranking loss is leveraged to make relevant video-text pairs closer than irrelevant pairs during the training phase. We define the bidirectional ranking loss for each level as follows, where s + and s − denote a positive sentence sample and a negative one for a video clip v, respectively. v + and v − denote a positive video sample and a negative one for a sentence s, respectively. And m 1 , m 2 are the margin. In addition, the negative sample is the most similar yet negative for the anchor v or s. By taking the average of ranking losses in four different levels, the final loss in the latent space can be denoted as L_lat(v, s).

Semantic space learning
Following the setting in [22], during the training phase, we put all the sentences in the training set together and count the number of occurrences of all semantic concepts. Next, we utilize the top 512 semantic concepts that appear most frequently as semantic categories. In order to transform f (v) 4 and f (s) 4 into a semantic space, we utilize the following method, where x = {v, s} , i = 5 , and σ (·) means a sigmoid activation function which is utilized to output a multi-label classification probability vector. Given a video-sentence pair and their shared ground-truth semantic concept y, the binary cross-entropy (BCE) loss is formulated as The BCE loss can improve the interpretability of the concept space but cannot improve the performance of video-text retrieval. Therefore, in order to measure the video-sentence similarity in the semantic concept space, we formulate We also leverage the improved triplet ranking loss in the semantic space as follows, The final loss in the semantic space can be formulated as,

Joint training of two spaces
By minimizing the sum of the latent-based loss and the semantic-based loss, we can train our LADN model as, Therefore, our LADN model can leverage different levels of patterns to improve the ranking performance and is also interpretable.

Measuring of video-text similarity
In the querying phase, we first obtain four similarities of different levels in four latent spaces and one similarity in the semantic space. By taking the average of four similarities of different levels in four latent spaces, we can obtain the final latent-based similarity between a video v and a sentence s as sim_lat(v, s). Then, min-max normalization is utilized to normalize sim_lat(v, s) and sim_lat(v, s) as s im_lat(v, s) and s im_sem(v, s) , respectively. Finally, we combine them in a weighted method as,  where γ is a weight to stride a balance between the latent space and the semantic space, ranging from 0 to 1.

Dataset
The MSR-VTT dataset [23] consists of 10,000 web video clips, each with 20 natural sentences. For this dataset, there are three different ways of the data partition. The original partition leverages 497 videos for validation, 2990 for testing, and 6513 for training. The second partition [18] leverages 1000 and 6656 videos for testing and training, respectively. The third partition [19] leverages 1000 videos for testing and 7010 for training. For the last two partitions, 1000 videos are randomly selected following [22]. We refer to these three partitions as A, B, C, respectively. The VATEX dataset [24] is a large-scale multilingual dataset for text-video retrieval. Each video contains 10 Chinese sentences and 10 English sentences. In our experiments, only the English sentences are utilized. According to [15], we utilize 25,991 videos for training, 1500 videos for validation, and 1500 videos for testing.
The TRECVID AVS (Ad hoc Video Search) task provides the largest test collection, the IACC.3 dataset, for zero-example video retrieval. The IACC.3 dataset, used in TRECVID AVS 2016-2018 tasks [25][26][27], contains 335,944 shots. Given an ad hoc query, the task is to return a ranked list of 1000 clips according to their likelihood of about the target query. In addition, TRECVID specifies 30 different queries each year.

Performance metrics
For the MSR-VTT dataset and Vatex dataset, R@k (k = 1, 5, 10, higher is better), Median rank (Med r, lower is better), and mean Average Precision (mAP, higher is better) are utilized to evaluate the performance of text-video cross-modal retrieval. R@k is the proportion of at least one correct item found in the top-k retrieved results. Med r means the median rank of the first correct item in the retrieved results. We also report the sum of all recalls (SumR) to reflect the overall performance.
For the TRECVID AVS tasks on the IACC.3 dataset, we utilize the official performance metric, inferred average precision (infAP, higher is better). For overall performance, we average infAP scores over the queries.

Experimental details
For VATEX, we utilize a 1,024-d I3D [35] representation to represent a video clip. As for the other datasets, we extract ResNeXt-101 [49] and ResNet-152 [50] representations for each frame. We concatenate these two representations to generate a 4,096-d CNN representation, which we call concatenated ResNeXt-ResNet. In addition, we average these two representations to generate a 2,048-d CNN representation, named average ResNeXt-ResNet.
Our proposed model is implemented using PyTorch. Taking MSR-VTT B [18], for example, we set all margins to 0.2, except for m 2 which is set to 0.3 in Eq. (3). The feature dimension of the BiGRU hidden state is set to 1024. The weight γ is set to 0.6. The dimensions of four different latent spaces are all set to 1536. We utilize stochastic gradient descent with Adam [51] to train our model. The batch size is 128. We set the initial learning rate to 0.0001. The maximum number of epoch is 50. We leverage an early stop mechanism to adjust the training process.

Experiments on MSR-VTT
We utilize the following twelve state-of-the-art methods for comparison.
• MEE [18] computes the final similarity by a weighted combination between sentence and four video latent spaces including appearance, motion, face, and audio. • W2VV [42] leverages three text representations, including BoW, word2vec, and GRU, to represent a sentence. • VSE++ [52] is a state-of-the-art method, which is widely utilized as the baseline for video-text retrieval. We replace its image feature with the feature obtained by mean pooling on frame-level features. • Mithun et al. [9] learns two latent spaces for videos and text and leverages a weighted triplet ranking loss to train the model. • W2VV++ [20] is an improved version of W2VV, which takes advantage of better text encoding strategies and an improved triplet ranking loss compared to W2VV. • CE [13] merges multiple expert features of video by a collaborative gating mechanism to represent a video. • TCE [14] leverages a tree-based encoder to represent text, and a temporal attentive video encoder to represent videos. • HGR [15] assumes a hierarchical decomposition of the video and text and projects them into three spaces including events, actions, and entities. • JPoSE [8] decomposes captions into nouns and verbs and creates two latent spaces for them, respectively. • JSFusion [19] utilizes a joint sequence fusion to combine text and video representations. • Miech et al. [7] leverages gated embedding modules to project videos and text into a common latent space. • Dual Encoding [22] uses multiple encoding strategies to represent text and video, respectively.
For fair comparison, we directly cite results from the original papers where available. However, video representations used in different papers vary. Therefore, we cite results from [22], which are implemented by leveraging the same concatenated ResNeXt-ResNet representation as the video representation. In addition, we retrain the Dual Encoding [22] by utilizing the average ResNeXt-ResNet representation. We train our LADN model by using both the concatenated ResNeXt-ResNet representation and the average ResNeXt-ResNet representation. Table 1 presents the retrieval performance of three different partitioning approaches in the MSR-VTT database. From Table 1, for all methods, their performance on the A partition is inferior to those on the B and C partition. Because A partition utilizes more  Compared with the SumR of Dual Encoding using the concatenated ResNeXt-ResNet representation, the ones of our method LADN using the average ResNeXt-ResNet representation can improve 5.81%, 7.34%, and 7.23% on MSR-VTT A, B, and C partition, respectively. Dual Encoding only maps spatial-temporal representation into a latent space. However, our proposed method LADN not only performs the same operation, but also projects global, temporal, and local representations into another three different latent spaces. Furthermore, LADN takes the average of four similarities in these four latent spaces to help improve the retrieval performance. Table 2 presents the model complexity of Dual Encoding and our method LADN. Compared with Dual Encoding, our method LADN needs more computational complexity. This is because LADN utilizes four different latent spaces, while Dual Encoding only leverages one latent space. When LADN projects representations into another three latent spaces, it needs more computational complexity. However, the text-video retrieval performance of LADN is better than the one of Dual Encoding. Figures 3,4,and 5 show the text-to-video retrieval results of our method LADN and Dual Encoding on the MSR-VTT B partition [18]. In Fig. 3, LADN can rank the corresponding results in the 1st place, but Dual Encoding fails, which proves the superiority of our method LADN. Figures 4 and 5 are still problematic to LADN and Dual Encoding. For the results in Fig. 4, although these two methods can get the right concept "paper, " they cannot find the intrinsic relationship between the sentence and the corresponding video. The possible reason for this is that the dataset contains only a small number of videos about "typewriter. " For the results in Fig. 5, although the top 1 retrieved result is incorrect, its semantic is consistent with the semantic of the ground truth for our method LADN and Dual Encoding.

Experiments on TRECVID AVS 2016-2018
We cite top 3 results on TRECVID AVS tasks for each year, including [53,55,57] in 2016, [39,40,58] in 2017, [54,56,59] in 2018. Additionally, we cite results from [16,60], and [37]. Other results are cited from [22]. Table 4 shows the experimental results, where the overall performance is the average score over three years. Our proposed method LADN gives the best performance, which demonstrates that LADN can effectively perform large-scale video retrieval by text query.

Ablation study
We design several variants of LADN to verify the effectiveness of each of its components. We construct the LADN(w/o g, t, l alignments) variant by removing global, temporal, and local alignments. The LADN(w/o semantic space) variant is built by removing the semantic space. We construct the LADN(w/ g, t, l semantic space) variant by mapping global, temporal, and local information into three semantic spaces, respectively. We remove the alignments in the global, local, temporal, and spatial-temporal spaces to construct LADN(w/o g alignment), LADN(w/o l alignment), LADN(w/o t alignment), and LADN(w/o s_t alignment), respectively. Table 5 presents the experimental results on MSR-VTT B partition [18]. Compared with LADN, LADN(w/o g, t, l alignments) gains the worst performance. This result proves the effectiveness of the level-wise aligned mechanism. Because LADN can make full use of global, temporal, and local information to further improve the text-video retrieval performance. By comparing LADN and LADN(w/o semantic space), we conclude that the semantic space plays a vital role in improving retrieval performance. Compared with LADN, although LADN(w/ g, l, t semantic spaces) utilizes more semantic spaces, it cannot further improve the retrieval performance. The lack of alignment in any of the four spaces, including global, temporal, local, or spatial-temporal spaces, will result in poor performance. It demonstrates that these four latent spaces are complementary to each other.

Conclusion
This paper proposes a method named level-wise aligned dual networks (LADN) for text-video retrieval. LADN first utilizes multi-level encoders to extract global, local, temporal, and spatial-temporal information in videos and sentences. Then, they are mapped into four different latent spaces and one semantic space. Finally, LADN combines the similarities of four latent spaces and one semantic concept space to improve cross-modal retrieval performance and increase interpretability. Extensive experiments conducted on three widely used datasets, including MSR-VTT, VATEX, and TRECVID Fig. 3 The text-to-video retrieval results of our LADN method and Dual Encoding on the MSR-VTT B partition [18]. The top 4 ranked videos are shown for each query, where the ground truth is marked with a red box, and the others are marked with a green box. The last column is the predicted concepts corresponding to the second column  . 4 The text-to-video retrieval results of our LADN and Dual Encoding on the MSR-VTT B partition [18]. The top 3 ranked videos and the ground truth are shown for each query. Additionally, the ground truth is marked with a red box, and the others are marked with a green box. The last column is the predicted concepts corresponding to the second column The text-to-video retrieval results of our LADN method and Dual Encoding on the MSR-VTT B partition [18]. The top 4 ranked videos are shown for each query, where the ground truth is marked with a red box, and the others are marked with a green box. The last column is the predicted concepts corresponding to the second column