Before describing the proposed model, the previous research on combining bigram and PLSA model by Nie et al. [13] is reviewed. This method is a special case (with certain independence assumptions) of our proposed method.
3.1. Nie et al.'s Bigram-PLSA Model
Nie et al. presented a combination of bigram and PLSA models [13]. Instead of
in (1), their bigram-PLSA model employs
resulting in
The EM procedure for training the combined model contains the following two steps.
E-step:
M-step:
where
is the number of times that the word pair
occurs in the document d
k
, and
is the number of words in the document d
k
.
3.2. Proposed Bigram-PLSA Model
We intend to combine the bigram and the PLSA models to take advantage of the strengths of both models for increasing the predictability of words in documents. In order to combine bigram and PLSA models, we incorporate the context word
in the PLSA parameters. In other words, we associate the generation of words and documents to the context word in addition to the latent topics.
The generative process of bigram-PLSA model can be defined by the following scheme:
-
(1)
Generate a context word
as the word history with probability
.
-
(2)
Select a document d
k
with probability
.
-
(3)
Pick a latent variable z
l
with probability
.
-
(4)
Generate a word w
j
with probability
.
Translating the generative process into a joint probability model results in
According to (6), the occurrence probability of the word w
j
given the document d
k
and the word history
is defined as
Equation (7) is an extended version of the aspect model that considers the word history in the word-document modeling and can be considered as a combination of bigram and PLSA models. In (7), the distributions
and
are the model parameters that should be estimated from training data. This model is similar to the original PLSA model except that the context words (word history)
is incorporated in the model parameters.
Like the original aspect model, the extended aspect model assumes conditional independence between word w
j
and document d
k
, that is, w
j
and d
k
are independent conditioned on the latent parameter z
l
and the context word
:
The justification behind the assumed conditional independence in the proposed model is the same reasoning that the PLSA model is using to make an analytical model, that is, simplification of the model formulation and reasonable reduction of the computational cost.
As in the original PLSA model, the equivalent parameterization of the joint probability in (6) can be written as
3.3. Parameter Estimation Using the EM Algorithm
Like original PLSA model, we re-estimate the parameters of bigram-PLSA model using the EM procedure. In the EM procedure, for E-step, we simply apply Bayes' rule to obtain the posterior probability of the latent variable z
l
given the observed data d
k
,
, and w
j
.
E-step:
We can rewrite (10) as
In the M-step, the parameters are updated by maximizing the log-likelihood of the complete data (words and documents) with respect to the probabilistic model. The likelihood of the complete data with respect to the probabilistic model is computed as
where
is the occurrence probability of the word pair
in the document
and
is the frequency of word pair
in the document
.
Let
be the set of model parameters. For estimating
, we use MLE to maximize the log-likelihood of the complete data:
Considering (7), we expand the above equation to
In (14), the left factor before the plus sign is omitted because it is independent of
. In order to maximize the log-likelihood, (14) should be differentiated. Differentiating (14) with respect to the parameters does not lead to well-formed formulae, so we try to find a lower bound for (14) using Jensen's inequality
The obtained lower bound should be maximized, that is, maximizing the right hand side of (15) instead of its left hand side. For maximizing the lower bound and re-estimating the parameters, we have a constrained optimization problem because all parameters indicate probability distributions. Therefore, the parameters should satisfy the constraints
In order to consider the above constraints, the right hand side of (15) has to be augmented by the appropriate Lagrange multipliers
where
and
are the Lagrange multipliers related to constraints specified in (16).
Differentiating the above equation partially with respect to the different parameters leads to(18)
Solving (18) and applying the constraints (16), the M-step re-estimation formulae, (19), are obtained:
The E-step and M-step are repeated until convergence criterion is met.
3.4. Implementation and Complexity Analysis
For implementing the EM algorithm, in the E-step, we need to calculate
for all i, j, k, and l. It requires four nested loops. Thus the time complexity of the E-step is
, where M, N, and K are the number of words, the number of documents, and the number of latent topics respectively. The memory requirements in the E-step include a four-dimensional matrix for saving
and a three-dimensional matrix for saving the normalization parameter (denominator of (11)). For reducing the memory requirements, note that it is not necessary to calculate and save
at the E-step; rather, it can be calculated in the M-step by multiplying the previous
and
and dividing the result by the normalization parameter. Therefore, we save only the normalization parameter at the E-step. According to (7), the normalization parameter is equal to
, thus the related matrix contains
elements, which is a large number for typical values of M and N.
In the M-step, we need to calculate the model parameters
and
specified in (19). These calculations require four nested loops, but note that we can decrease the number of loops to three nested loops by considering only the word pairs that are present in the training documents instead of all word pairs. Thus the time complexity in the M-step is O(KNB) where B is the average number of the word pairs in the training documents.
The memory requirements in the M-step include two three-dimensional matrices for saving
and
and two two-dimensional matrices for saving the denominators of (19). Saving these large matrices results in high memory requirements in the training process.
is another matrix that can be implemented by a sparse matrix containing the indices of the word pairs presented in each training document and the counts of the word pairs.
3.5. Extension to
-gram
We can extend the bigram-PLSA model to
-gram-PLSA model by considering the
context words
instead of only one context word w
i
as the word history. The generative process of the
-gram-PLSA model is similar to the bigram-PLSA model except that in step 1, instead of generating one context word,
context words should be generated. Therefore, the combined model can be expressed by
where
is a sequence of
words. We can follow the same EM procedure for parameter estimation in the
-gram-PLSA model where w
i
is replaced by
in all formulae. In the re-estimation formulae, we have
that is the number of occurrences of the word sequence
in the document d
k
.
Combining PLSA model and
-gram model for
leads to high complexity in time and memory of the training process. As discussed in Section 3.4, the time complexity of the EM algorithm is
for
. Consequently, the time complexity for higher order
-grams is
that grows exponentially as n increases. In addition, the memory requirement for
-gram-PLSA combination is very high. For example, for saving the normalization parameters, we need a (
)-dimensional matrix which contains
elements. Therefore, the memory requirement also grows exponentially as n increases.
3.6. Comparison with Nie et al.'s Bigram-PLSA Model
As discussed in Section 3.1, Nie et al. have presented a combination of bigram and PLSA models in 2007 [13]. This work does not have a strong mathematical foundation and one cannot derive the re-estimation formulae via the standard EM procedure based on that. Nie et al.'s work is based on an assumption of independence between the latent topics z
l
and the context words w
i
. According to this assumption, we can rewrite (7) as
According to (21), the difference between our model and Nie et al.'s model is in the definition of the topic probability. In Nie et al.'s model the topic probability is conditioned on the documents, but in our model, the topic probability is further conditioned on the bigram history. In Nie et al.'s model, the assumption of independence between the latent topics and the context words leads to assigning the latent topics to each context word evenly, that is, the same numbers of latent variables are assigned to decompose the word-document matrices of all context words despite their different complexities. Thus, they propose a refining procedure that unevenly assigns the latent topics to the context words according to an estimation of their latent semantic complexities.
In our proposed bigram-PLSA model, we relax the assumption of independence between the latent topics and the context words and achieve a general form of the aspect model that considers the word history in the word-document modeling. Our model automatically assigns the latent topics to the context words unevenly because for each context h
i
, there is a distribution
that assigns the appropriate number of latent topics to that context. Consequently,
remains zero for those z
l
inappropriate to the context word w
i
.
The number of free parameters in our proposed model is
, whereM, N, and K are the number of words, the number of documents, and the number latent topics, respectively. On the other hand, the number of free parameters in Nie et al.'s model is
that is less than the number of free parameters in our model. Consequently, the training time of Nie et al.'s model is less than the training time of our model.