Before describing the proposed model, the previous research on combining bigram and PLSA model by Nie et al. [13] is reviewed. This method is a special case (with certain independence assumptions) of our proposed method.
3.1. Nie et al.'s BigramPLSA Model
Nie et al. presented a combination of bigram and PLSA models [13]. Instead of in (1), their bigramPLSA model employs resulting in
The EM procedure for training the combined model contains the following two steps.
Estep:
Mstep:
where is the number of times that the word pair occurs in the document d _{
k
} , and is the number of words in the document d _{
k
}.
3.2. Proposed BigramPLSA Model
We intend to combine the bigram and the PLSA models to take advantage of the strengths of both models for increasing the predictability of words in documents. In order to combine bigram and PLSA models, we incorporate the context word in the PLSA parameters. In other words, we associate the generation of words and documents to the context word in addition to the latent topics.
The generative process of bigramPLSA model can be defined by the following scheme:

(1)
Generate a context word as the word history with probability .

(2)
Select a document d _{
k
} with probability .

(3)
Pick a latent variable z _{
l
} with probability .

(4)
Generate a word w _{
j
} with probability .
Translating the generative process into a joint probability model results in
According to (6), the occurrence probability of the word w _{
j
} given the document d _{
k
} and the word history is defined as
Equation (7) is an extended version of the aspect model that considers the word history in the worddocument modeling and can be considered as a combination of bigram and PLSA models. In (7), the distributions and are the model parameters that should be estimated from training data. This model is similar to the original PLSA model except that the context words (word history) is incorporated in the model parameters.
Like the original aspect model, the extended aspect model assumes conditional independence between word w _{
j
} and document d _{
k
}, that is, w _{
j
} and d _{
k
} are independent conditioned on the latent parameter z _{
l
} and the context word :
The justification behind the assumed conditional independence in the proposed model is the same reasoning that the PLSA model is using to make an analytical model, that is, simplification of the model formulation and reasonable reduction of the computational cost.
As in the original PLSA model, the equivalent parameterization of the joint probability in (6) can be written as
3.3. Parameter Estimation Using the EM Algorithm
Like original PLSA model, we reestimate the parameters of bigramPLSA model using the EM procedure. In the EM procedure, for Estep, we simply apply Bayes' rule to obtain the posterior probability of the latent variable z _{
l
} given the observed data d _{
k
}, , and w _{
j
}.
Estep:
We can rewrite (10) as
In the Mstep, the parameters are updated by maximizing the loglikelihood of the complete data (words and documents) with respect to the probabilistic model. The likelihood of the complete data with respect to the probabilistic model is computed as
where is the occurrence probability of the word pair in the document and is the frequency of word pair in the document .
Let be the set of model parameters. For estimating , we use MLE to maximize the loglikelihood of the complete data:
Considering (7), we expand the above equation to
In (14), the left factor before the plus sign is omitted because it is independent of . In order to maximize the loglikelihood, (14) should be differentiated. Differentiating (14) with respect to the parameters does not lead to wellformed formulae, so we try to find a lower bound for (14) using Jensen's inequality
The obtained lower bound should be maximized, that is, maximizing the right hand side of (15) instead of its left hand side. For maximizing the lower bound and reestimating the parameters, we have a constrained optimization problem because all parameters indicate probability distributions. Therefore, the parameters should satisfy the constraints
In order to consider the above constraints, the right hand side of (15) has to be augmented by the appropriate Lagrange multipliers
where and are the Lagrange multipliers related to constraints specified in (16).
Differentiating the above equation partially with respect to the different parameters leads to(18)
Solving (18) and applying the constraints (16), the Mstep reestimation formulae, (19), are obtained:
The Estep and Mstep are repeated until convergence criterion is met.
3.4. Implementation and Complexity Analysis
For implementing the EM algorithm, in the Estep, we need to calculate for all i, j, k, and l. It requires four nested loops. Thus the time complexity of the Estep is , where M, N, and K are the number of words, the number of documents, and the number of latent topics respectively. The memory requirements in the Estep include a fourdimensional matrix for saving and a threedimensional matrix for saving the normalization parameter (denominator of (11)). For reducing the memory requirements, note that it is not necessary to calculate and save at the Estep; rather, it can be calculated in the Mstep by multiplying the previous and and dividing the result by the normalization parameter. Therefore, we save only the normalization parameter at the Estep. According to (7), the normalization parameter is equal to , thus the related matrix contains elements, which is a large number for typical values of M and N.
In the Mstep, we need to calculate the model parameters and specified in (19). These calculations require four nested loops, but note that we can decrease the number of loops to three nested loops by considering only the word pairs that are present in the training documents instead of all word pairs. Thus the time complexity in the Mstep is O(KNB) where B is the average number of the word pairs in the training documents.
The memory requirements in the Mstep include two threedimensional matrices for saving and and two twodimensional matrices for saving the denominators of (19). Saving these large matrices results in high memory requirements in the training process. is another matrix that can be implemented by a sparse matrix containing the indices of the word pairs presented in each training document and the counts of the word pairs.
3.5. Extension to gram
We can extend the bigramPLSA model to gramPLSA model by considering the context words instead of only one context word w _{
i
} as the word history. The generative process of the gramPLSA model is similar to the bigramPLSA model except that in step 1, instead of generating one context word, context words should be generated. Therefore, the combined model can be expressed by
where is a sequence of words. We can follow the same EM procedure for parameter estimation in the gramPLSA model where w _{
i
} is replaced by in all formulae. In the reestimation formulae, we have that is the number of occurrences of the word sequence in the document d _{
k
}.
Combining PLSA model and gram model for leads to high complexity in time and memory of the training process. As discussed in Section 3.4, the time complexity of the EM algorithm is for . Consequently, the time complexity for higher order grams is that grows exponentially as n increases. In addition, the memory requirement for gramPLSA combination is very high. For example, for saving the normalization parameters, we need a ()dimensional matrix which contains elements. Therefore, the memory requirement also grows exponentially as n increases.
3.6. Comparison with Nie et al.'s BigramPLSA Model
As discussed in Section 3.1, Nie et al. have presented a combination of bigram and PLSA models in 2007 [13]. This work does not have a strong mathematical foundation and one cannot derive the reestimation formulae via the standard EM procedure based on that. Nie et al.'s work is based on an assumption of independence between the latent topics z _{
l
} and the context words w _{
i
}. According to this assumption, we can rewrite (7) as
According to (21), the difference between our model and Nie et al.'s model is in the definition of the topic probability. In Nie et al.'s model the topic probability is conditioned on the documents, but in our model, the topic probability is further conditioned on the bigram history. In Nie et al.'s model, the assumption of independence between the latent topics and the context words leads to assigning the latent topics to each context word evenly, that is, the same numbers of latent variables are assigned to decompose the worddocument matrices of all context words despite their different complexities. Thus, they propose a refining procedure that unevenly assigns the latent topics to the context words according to an estimation of their latent semantic complexities.
In our proposed bigramPLSA model, we relax the assumption of independence between the latent topics and the context words and achieve a general form of the aspect model that considers the word history in the worddocument modeling. Our model automatically assigns the latent topics to the context words unevenly because for each context h _{
i
}, there is a distribution that assigns the appropriate number of latent topics to that context. Consequently, remains zero for those z _{
l
} inappropriate to the context word w _{
i
}.
The number of free parameters in our proposed model is , whereM, N, and K are the number of words, the number of documents, and the number latent topics, respectively. On the other hand, the number of free parameters in Nie et al.'s model is that is less than the number of free parameters in our model. Consequently, the training time of Nie et al.'s model is less than the training time of our model.