Multi-task learning for abstractive text summarization with key information guide network

Neural networks based on the attentional encoder-decoder model have good capability in abstractive text summarization. However, these models are hard to be controlled in the process of generation, which leads to a lack of key information. And some key information, such as time, place, and people, is indispensable for humans to understand the main content. In this paper, we propose a key information guide network for abstractive text summarization based on a multi-task learning framework. The core idea is to automatically extract the key information that people need most in an end-to-end way and use it to guide the generation process, so as to get a more human-compliant summary. In our model, the document is encoded into two parts: results of the normal document encoder and the key information encoding, and the key information includes the key sentences and the keywords. A multi-task learning framework is introduced to get a more sophisticated end-to-end model. To fuse the key information, we propose a novel multi-view attention guide network to obtain the dynamic representations of the source text and the key information. In addition, the dynamic representations are incorporated into the abstractive module to guide the process of summary generation. We evaluate our model on the CNN/Daily Mail dataset and experimental results show that our model leads to significant improvements.


Introduction
Text summarization is the task of automatically generating a brief summary from a given text while maintaining the key information. There are two main approaches to text summarization: extractive and abstractive. Extractive models [1,2] generally obtain a summary by extracting a few sentences from the original text, while abstractive models [3,4] produce a summary by generating new sentences. Recently, the neural encoder-decoder framework [5] inspires the research on abstractive text summarization. It is generally believed that the language of this model is more fluent. Moreover, the encoder-decoder *Correspondence: xuweiran@bupt.edu.cn 1 Beijing University of Posts and Telecommunications, Xitucheng Road, Beijing 100876 , China Full list of author information is available at the end of the article framework is also convenient for automatically adjusting the parameters.
Both the original text and the summarization are human languages. In order to generate a higher quality result, the model must be able to "understand" and represent the original text in a human-like manner. Entities such as time, place, and person are the keys for human to understand the main content. Therefore, it is essential to generate these key information into the summary. Although current abstractive models proved to be capable of capturing the regularities of the text summarization, they are hard to be controlled in the process of generation. In other words, without external guidance, it is difficult to ensure that those abstractive models could identify key information and generate them into the output [6].
Some studies have tried to solve these problems. Zhou et al. [7] proposed a selective gate network to retain more key information in the summary. However, the selective gate network, which is controlled by the representation of the input text, controls the information flow from encoder to decoder for just once. If some key information does not pass the network, it is hard for them to appear in the summary. See et al. [8] proposed a pointer-generator model, which uses the pointer mechanism [9] to copy words from the input text, to deal with the out-of-vocabulary (OOV) words. Without external guidance, it is hard for the pointer to identify keywords. In previous work, we combine the extractive model and the abstractive model and use the former one to obtain keywords as guidance for the latter one [10]. However, this model is not sophisticated enough. And it is a pipelined system that extracts keywords by the TextRank algorithm.
In addition, the decoder model, especially the attention mechanism, is also crucial for generating a summary containing all the key information. Hsu et al. [11] proposed a unified model which combines sentence-level and word-level attentions by simple scalar multiplication and renormalization. Their sentence-level attention is fixed, which means it will be the same for each generated word. However, the focus of the abstractive model on sentence should be constantly changing during the summary generation. Tan et al. [6] proposed a graph-based attention method that can discover the salient information of a document in the encoder-decoder framework. They also use the TextRank algorithm to identify key sentences in the input text. We argue that the abstractive summarization model should contain key information extraction and then use the key information to dynamically guide the process of summary generation.
In this paper, we propose a key information guide network for abstractive text summarization based on a multitask learning framework. Our core idea is to automatically extract the key information that people need most in an end-to-end way and use it to guide the generation process, so as to get a more human-compliant summary. In our model, the document is coded into two parts: the result of the document encoder and the key information encoding, and the key information includes the key sentences and the keywords. Multi-task learning framework is introduced to get a more sophisticated end-to-end model. The main part is an abstractive model based on encoder-decoder structure, in which a normal document encoder is employed. The second task is the key sentence extraction, and an extractive method is included here. Another extractive model is used to extract keywords. The extractive models and the abstractive model will be trained jointly in this multi-task framework, so they can benefit each other. In this encoder model, several semantic coding layers are naturally formed, i.e., word layer, sentence layer, and document layer. In order to simplify the model, our key information, i.e., key sentences and keywords, is extracted from the outputs of sentence layer encoder. In the decoder, the key information will guide the generation process by two ways: the attention mechanism and the pointer mechanism. To fuse the key information, we propose a novel multi-view attention guide network to obtain the dynamic representations of the source text and the key information. In addition, the dynamic representations are incorporated into the abstractive module to guide the process of summary generation. Experiments show that our model achieves significant improvements.
Our contributions are as follows: • We propose a key information guidance network model to obtain more human-compliant summaries.
In this model, a document is represented as keyword encoding, key sentence encoding, and document encoding, and then, they work together to guide the generation of the summary. • Multi-task learning framework is introduced to get a more sophisticated end-to-end model. Extractive models and the abstractive model are fused in an end-to-end model, which can make full use of various training data to adjust model parameters. • In the decoder, we propose a novel multi-view attention guide network to obtain the dynamic representations of the source text and the key information. In addition, the dynamic representations are incorporated into the abstractive module to guide the process of summary generation.
The rest of the paper is structured as below: Section 2 will review the related work. Section 3 introduces the key information guide network, which is our baseline. Section 4 presents our detailed model description, followed by experiments and analysis in Section 5. At last, we conclude our work in Section 6.

Abstractive summarization
Since Rush et al. [3] first bring up the encoder-decode framework in the task of text summarization, abstractive models [6,10,12] have been widely used to generate the summary like human being. Hsu et al. [11] propose a unified model combining sentence-level and word-level attentions to take advantage of both extractive and abstractive summarization approaches. Most methods select content at the sentence level [11] and the word level [10]. Current state-of-the-art models use the pointer-generator mechanisms. Our model selects the key information including keywords and key sentences and then incorporates the key information in the abstractive module to guide the process of summary generation.

Extractive summarization
Yasunaga et al. [2] use a graph-based neural model to extract salient sentences. Nallapati et al. [4] use recurrent neural networks to read the article and get the representations of the sentences and article to select sentences. Most models utilize side information (i.e., image captions and titles) to help the sentence classifier choose sentences. In addition, Tan et al. [6] combined recurrent neural networks with graph convolutional networks to compute the salience (or importance) of each sentence. Neural models are able to leverage large-scale corpora and achieve better performance than traditional methods. While some extractive summarization models are able to achieve better performance, they all suffer from low relevance between sentences.

Content selection
Since text summarization requires key information selection, the abstractive models first select key information and then generate the summary based on it. Some recent work solves the sentence extraction and document modeling in an end-to-end framework. Nallapati et al. [4] propose an encoder-decoder approach where the encoder hierarchically learns the representation of sentences and documents while an attention-based sentence extractor extracts salient sentences sequentially from the original document. Then, based on the above work, they propose a recurrent neural network-based sequence-to-sequence model for sequential labeling of each sentence in the document. However, some key information may be lost in the selection process, which leads to an inability of the generation.

Pointer-generator network
The pointer network [9] is a sequence-to-sequence model, which uses a soft attention distribution to produce an output sequence consisting of elements in the input sequence. Pointer networks have been used to create a hybrid approach to NMT (neural machine translation), language models, and abstractive models. This method is close to the mandatory note compression model and the CopyNet model. The pointer-generator model [8] has some small differences: they calculate the explicit switching probability to copy words from the input text. In addition, the recycling of attention distribution is distributed as a copy, which decides whether to generate a word from the vocabulary or the input text. When multiple words appear in the source text, we obtain the probability mass of all corresponding parts of the attention distribution. They believe that calculating explicit switch probability is useful for improvement. In this way, they reduce the probability of all generated words or all copied words at once, rather than reducing them separately. These two distributions are very useful, and they find that their simpler approach is sufficient for similar purposes. From the experimental results, they observe that the pointer mechanism often copies a word and multiple occurrences of the word appear in the source text.
The pointer-generator model is very different from the method of other pointer network. These jobs train their pointer components to activate only vocabulary words or named entities, and they do not mix the probability of replicating distributions and lexical distributions. The pointer-generator model proposes a hybrid approach described here and is more suitable for abstract summarization. In addition, they demonstrate that replication mechanisms are critical for accurately replicating rare but vocabulary words.

Prediction-guide mechanism
He et al. [13] propose a prediction network to predict the long-term value in the final generation summary. In addition, they propose to use a prediction network to improve beam search, which takes the source sentence, the currently available decoded output, and the candidate words at step t as inputs and predicts the long-term value of the partial target sentence (e.g., the BLEU score) if it is done by the NMT (neural machine translation) model. In accordance with the practice of reinforcement learning, they call it the network value network. Specifically, they propose a recurring structural network of values and train their parameters from bilingual data. During the test time, when selecting the word w for decoding, they consider the conditional probability given by the NMT model and the long-term value predicted by the value network. Our forecasting guidance mechanism is used to ensure more critical information covered in the final summary.

Multitask learning and joint training
Multi-task learning and joint training are also unavoidable problems in this paper. However, this paper mainly refers to [14][15][16][17][18][19][20][21][22] to construct our own algorithm, which cannot be regarded as a new model.

The key information guide network
In this section, we will introduce the key information guide network (KIGN) [10] (Fig. 1), which is served as our baseline.

Encoder-decoder model based attention
Our encoder-decoder framework is similar to that of [4]. The tokens of the input article x = {w 1 , w 2 , ..., w n } are entered into the encoder, which maps the text into the encoder hidden state sequence {h 1 , h 2 , ..., h n }. At each decoding time step t, the previous word embedding w t−1 and the previous context vector c t−1 as input to obtain the decoder hidden state s t . The context vector c t is obtained (2020) 2020: 16 Page 4 of 11

Fig. 1
The key information guide network. In the key information guide network, we encode the keywords to the key information representation [10] by using the attention mechanism : where v, W h , and W s are learnable parameters and h i is the hidden state of the input token w i . The context vector c t , which represents what has been read from the source text, is concatenated with the decoder hidden state s t to predict the next word with a softmax layer over the whole vocabulary: where f represents a linear function or a neural network.

Key information guide network
Most encoder-decoder models [7,8] simply take the source text as input and then output a summary, which is difficult to control during the generating process, resulting in missing key information in the summary. We propose a key information-guided network which can guide the generation process by the attention mechanism and the pointer mechanism. Firstly, keywords are extracted by TextRank algorithm. Then, a max-pooling CNN model [23,24] is employed to merge the key information. As shown in Fig. 1, the keywords are entered one by one into the key information guide network, and then, we connect the last forward hidden state − → h n and the backward hidden state ← − h 1 as key information representation k: The traditional attention mechanism is difficult to identify keywords, which only use the decoder state as a query to obtain the attention distribution of the encoder hidden state. We use the key information to represent k as an extra input to the attention mechanism and change Eq. (1) to: where W k is a learnable parameter. We use the new e ti to obtain new context vector c t (Eq. 3) and attention distribution α e t (Eq. 2) . Then, we apply k to represent the key information and use the new context vector c t to calculate the probability distribution of all words in the vocabulary, changing Eq. (4) to: where v represents that y t is from the target vocabulary. The KIGN makes the attention mechanism more focus on the keywords, which is similar to introduce prior knowledge to the model.

Pointer mechanism
In order to deal with the OOV (out-of-vocabulary) problem, we combine the pointer network [9] with our key information-based generating method, which enable us to copy words as well as generate text. In the pointer generator model, we need to compute a soft switch p sw to select between the generated words and the duplicated words: where w T k , w T c , w T s , and b sw are parameters and σ is the sigmoid function.
Our pointer mechanism is equipped with a key information representation that identifies the keyword. We use the new attention distribution α e ti as the probability of entering the token w i and obtain the following probability distribution to predict the next word: (9) Note that if w is an OOV word, P v (y t = w) is zero.
In the process of training, we train the model by minimizing the maximum likelihood loss in each decoding time step t, which is the most widely used in the task of generation. In addition, we define y * t as the target word for the decoding time step t, and the overall loss is:

Multi-task learning for abstractive text summarization with KIGN
The KIGN introduced in Section 3 enables the text summary generator to pay more attention to the key information that humans are most concerned about. However, this model is not sophisticated enough. For example, key information is obtained through the Textrank algorithm rather than a learning-based approach. In this section, we propose a multi-task learning model (Fig. 2) for abstractive text summarization based on the KIGN framework, which includes a novel document encoder, a key information extractor, and some methods such as joint training and prediction-guide mechanism. As shown in Fig. 2, firstly, we propose a document encoder which includes a word-level encoder and a sentence-level encoder. In this way, we can obtain global features for each word and each sentence's encoding respectively. Then, the key information extraction layer selects keywords and key sentences. Next, a multi-source attention will guide the process of generation. Finally, the abstracter and key information extractors, which includes a keyword extractor and a key sentence extractor, are trained by minimizing three loss functions in an end-toend manner.

Document encoder
We propose a novel document encoder to encode words and sentences respectively instead of only using the hierarchical encoder to encode both [4]. This method will facilitate the subsequent keyword extraction and key sentence extraction.

The global word encoder
The task of keyword extraction requires the information of the whole document as well as the information of one single word. In order to encode global information into a hidden state to represent a word, the encoding of each word must be based on at least one layer of neural network. Therefore, we use a bidirectional LSTM as the global word encoder. The tokens of the input text {w 1 , w 2 , ..., w n } are forward fed into the global word encoder, which maps the text into a sequence of hidden where − → h w i and e w i denote the forward hidden state and embedding of word w i , respectively, and i = 1, · · · , n. Likewise, backward hidden states can also be obtained, i.e., ← − h w i i = 1, · · · , n. Therefore, the encode of the ith word is: We concat the forward and backward final states in the global word encoder to obtain the representation of the document:

The sentence encoder
The sentence encoder is a hierarchical encoder, which consists of the word-level encoder and the sentencelevel encoder. A document can be seen as a sequence of words {w 1 , w 2 , ..., w n }. And this document also can be seen as a sequence of sentences {s 1 , s 2 , ..., s m } and s i = {w i,1 , w i,2 , · · · }. As shown in Fig. 2, the word-level encoder is a bidirectional LSTM that encodes the words of each sentence into the sentence representation: where h w i,j and e w i,j denotes the hidden state and embedding of word w i,j , respectively. We concat the forward and backward final hidden states in the word-level encoder as the sentence representation: Then, we use another bidirectional LSTM as the sentencelevel encoder, which updates each sentence representation: where h s i denotes the hidden state of sentence s i . The representation of the document based on the sentence encoder is:

Key information extraction
The task of text summarization needs to remove the unnecessary information and retain key information from the input document. Since it is difficult for the encodedecoder framework to find keywords and key sentences, we employ a key information extraction model to extract keywords and key sentences. Key information extractors can be trained by supervised learning in a multi-task framework.

Keyword extraction
First, we use the results of the global word encoder, i.e., h w i i = 1, · · · , n, and d w , to extract keywords in the sequence to sequence model: where W g and U g denote weight parameters, b g the bias vector, and σ the sigmoid activation function. Then, the top N words {w k 1 , ..., w k N } are obtained as the keywords.

Key sentence extraction
Similarly, the codes based on the sentence encoder, i.e., h s i i = 1, · · · , m, and d s , are used to extract key sentences. For each sentence s i , the gate network takes the d s and h s i as inputs to calculate the probability value: where W s and U s denote weight matrices, b s the bias vector, and σ the sigmoid activation function. We also regard the top M sentences {s k 1 , ..., s k M } as the key sentences based on the probability value of each sentence.

Improved key information guide network
Based on the abovementioned methods, the KIGN introduced in Section 3 can be improved. The first is to update the presentation of key information. Now, two kinds of key information are obtained, i.e., keywords and key sentences. According to Section 3, we can get keyword representation k w and key sentence representation k s , and k w and k s are substitutes for k.
The second is to update the document representation. In Section 4.1, the document is encoded into the hidden states of words, i.e., {h w i |i = 1, · · · , n}, and the hidden states of sentences, i.e. ,{h s i |i = 1, · · · , m}. By replacing h i in (1)(3) with h w i , the word encoder-based context vector c w t is obtained: In the same way, the sentence encoder based context vector c s t can also be obtained. As a result, (7) is updated by: P v (y t |y 1 , ..., y t−1 ) = softmax(f (s t , c w t , c s t , k w , k s )) (22) In addition, there are still some trivial parts that need to be updated, such as (8), which will not be repeated here.

Joint training the model
We jointly train the abstract generator, keyword extractor, and key sentence extractor by minimizing three loss functions: L abs , L kw , and L ks . The final loss is as below: where λ 1 , λ 2 , and λ 3 are hyper-parameters. Similar to [8], we use the pointer mechanism to calculate the final word distribution P final (ŷ t ). Then, we train the abstracter by minimizing the negative log-likelihood: whereŷ t is the tth token in the reference summary. In our model, the losses of the keyword extraction L kw and the key sentence extraction L ks are the binary cross entropy: where M, N are hyper-parameters. Ground truth label. To obtain the keywords, we filter the stop words in the reference summary and use the remaining as keyword ground truth labels r w = {r w i } i . For the key sentence ground truth label r s = {r s i } i , we measure the informativity of each sentence in the text and select key sentences similar to [11]. In order to obtain the ground truth label, we first measure the informativeness of each sentence in the article by calculating the ROUGE-L recall score between the sentence and the reference abstract summary. Second, we sort the sentences according to the amount of information and select sentences in order of information from high to low. If the new sentence can increase the amount of information for all selected sentences, we add a sentence at a time. Finally, we obtained the ground truth label and trained our extractor by minimizing the loss function. Our method is similar to [11] that aims to extract the final summary of the article so that they use the ROUGE F-1 score to select the real sentence. In our model, we use the ROUGE recall score to get as much reference summary information as possible.

Prediction-guide mechanism at test time
In the process of test time, when the model predict the next word, we not only consider the above probability (Eq. 9), but also consider the long-term value predicted by the prediction guidance mechanism. The predictive guidance mechanism is based on [13].
Our predictive guidance mechanism is a single layer feedforward network with the sigmoid activation function that predicts the range of critical information covered in the final summary. At each decoding time step t, we perform a mean summation of the decoder hidden state. s t = 1 t t l=1 s l , the encode statesh n = 1 n n i=1 h i , and the key information k as inputs to obtain the long-term value.
We sample the two parts for each x summary y p1 and y p2 and randomly stop to gets t . Then, we complete the build from y p to get the M average decoder hidden states completed summary S(y p )(using beam search) and calculate the average score: where cos is the function of cosine similarity. We hope the predicted value of v(x, y p1 ) can be larger than v(x, y p2 ) if AvgCos(x, y p1 ) > AvgCos(x, y p2 ). Therefore, the loss function of the predicted boot network is as follows: where AvgCos(x, y p1 ) > AvgCos(x, y p2 ).
In testing, we first calculate the normalized logarithm probability for each candidate and then linearly combine it with the predicted guidance network predictive values. In addition, given the abstract model P(y|x) (Eq. 9), predict the boot network v(x, y) and the hyperparameter α ∈ (0, 1). The score of the sequence y of x is calculated by: where α ∈ (0, 1) is a hyperparameter.

Experiment setup
The CNN/Daily Mail dataset [4,25] is employed here, and the data is processed in the same way as [8]. We use three 300-dimensional LSTMs for the global word encoder and sentence encoder. And a 50k-word vocabulary is used. During the training and testing, we truncate the text to 400 tokens for word encoder and limit the length of the summary to 100 tokens. We train the model using Adagrad [15] with learning rate 0.15 and an initial accumulator value of 0.1. The batch size is set to 16, and the number of keywords and key sentences are 40 and 10. We jointly train the three tasks and set λ 1 = 1 and λ 2 = λ 3 = 0.5. Following the previous work, our major evaluation metric is F-score of ROUGE. In addition, for the prediction mechanism, we use the single-layer feed forward network and set the number of nodes with 800. For the hyperparameter α, we use a different α to test the performances of KIGN+prediction-guide model during the decoding time. As can be seen from Fig. 3, the performance of our model is stable for the α ranging from 0.8 to 0.95. In addition, when we set the α as 0.9, we can get the highest F-score. We can see that when we set M as 8 and adapt mini-batch training with the batch size of 16. The network is trained with AdaDelta.
During the process of the training and testing, we truncated the input token to 400 and limited the length of the output summary to 100 tokens. Similar to [8] at the time of the test, we truncated the input token to 400 and limited the length of the output summary to 120 tokens. We trained the keyword network model with training iterations of less than 200,000. Then, we train a single-layer feed forward network based on the KIGN model. Finally, during the test, we combine the KIGN model with the predictive guidance mechanism to generate a summary.

Results and discussion
The experimental results are shown in Table 1. The first five are commonly used sequence to sequence methods: Seq2Seq model enhanced by attention mechanism with a 150k vocabulary, Seq2Seq model enhanced by attention mechanism with a 50k vocabulary, Seq2Seq model with a graph-attention, hierarchical attention networks method [4], and Seq2Seq model equipped with pointermechanism. Table 1 shows that our baseline model, named key information guide network (shown in Fig. 1), obtain better scores than Seq2Seq model equipped with pointermechanism by + 1.3 ROUGE-1, + 0.9 ROUGE-2, and + 1.0 ROUGE-L. With the help of the pointermechanism (shown in Fig. 1), key information guide network (KIGN+Prediction-guide) has achieved much better results by + 2.5 ROUGE-1, + 1.5 ROUGE-2, and + 2.2 ROUGE-L, while Key Information Guide Network with a multi-task learning frame (shown in Fig. 2) obtains the best scores, and the additional improvements are + 0.2 ROUGE-1, + 0.2 ROUGE-2, + and 0.2 ROUGE-L. We can also see in Table 1 that if the keywords and sentences are given, the result can be better (40.34 ROUGE-1, 17.70 ROUGE-2, 36.57 ROUGE-L), which also proves that it is reasonable to use the key information to guide the generation of text summarization.

Case study
In order to demonstrate the ability of our method to obtain key information, the processing results of a specific piece of text are shown in Fig. 4. The original is listed in the top half of Fig. 4, and the key information is identified in bold. Next are the given gold summary and the outputs of the two models. The original text is about Google handwriting input working on android handsets and some function introduction, and the key information is "google claims, " "read anyone's handwriting, " "android handsets can under 82 languages in 20 distinct scripts, " and "works with both printed and cursive writing input with or without a stylus. " We can see that the summary of the baseline model equipped with pointer-mechanism covers only "google have cracked the problem of reading handwriting. " While our model summarizes almost all the key information.

Ablation studies
We conduct ablation experiments to show the effect of our multi-view attention guide network. As show in Table 2, we can observe that all kinds of view contribute more or less performance boost to the model. Apparently, we can observe that the keyword view attention and guide generation makes much contribution to the improvement of our model. First, we can observe that the key information extraction can improve the scores of ROUGE (+ 1.1 ROUGE-1, + 0.6 ROUGE-2, + 0.9 ROUGE-L).

Conclusions
In this paper, we propose a multi-task learning model with key information guide network that combines the extractive method and the abstractive method in a novel way. Our model is based on the key information guide network. The key information guide network uses an extractive method to obtain the keywords from the text. Then, it encodes the keywords to the key information representation and integrates it into the abstractive model to guide the process of generation. The guidance is mainly in two aspects: the attention mechanism and the pointer mechanism. Based on the key information guide network, we propose a multi-task learning model to jointly train the extractive model and the abstractive model. Specifically, we use a document encoder to encode words and sentences of the input text respectively. Then, we extract key information including keywords and key sentences based on the encoder. In this way, our key information extraction is not from the TextRank method. We obtain the key information from our sequence to  sequence model. Finally, we jointly train the three tasks of keyword extraction, key sentence extraction, and summary generation. At test time, we use a prediction-guide mechanism, which can obtain the long-term value for future decoding, to further guide the summary generation. Experiments show that our model leads to significant improvements.