5 LDA as a Generative Model

This chapter is going to focus on LDA as a generative model. I’m going to build on the unigram generation example from the last chapter and with each new example a new variable will be added until we work our way up to LDA. After getting a grasp of LDA as a generative model in this chapter, the following chapter will focus on working backwards to answer the following question: “If I have a bunch of documents, how do I infer topic information (word distributions, topic mixtures) from them?”

5.1 General Terminology

Let’s get the ugly part out of the way, the parameters and variables that are going to be used in the model.

  • alpha (\(\overrightarrow{\alpha}\)) : In order to determine the value of \(\theta\), the topic distirbution of the document, we sample from a dirichlet distribution using \(\overrightarrow{\alpha}\) as the input parameter. What does this mean? The \(\overrightarrow{\alpha}\) values are our prior information about the topic mixtures for that document. Example: I am creating a document generator to mimic other documents that have topics labeled for each word in the doc. I can use the total number of words from each topic across all documents as the \(\overrightarrow{\beta}\) values.

  • beta (\(\overrightarrow{\beta}\)) : In order to determine the value of \(\phi\), the word distirbution of a given topic, we sample from a dirichlet distribution using \(\overrightarrow{\beta}\) as the input parameter. What does this mean? The \(\overrightarrow{\beta}\) values are our prior information about the word distribution in a topic. Example: I am creating a document generator to mimic other documents that have topics labeled for each word in the doc. I can use the number of times each word was used for a given topic as the \(\overrightarrow{\beta}\) values.

  • theta (\(\theta\)) : Is the topic proportion of a given document. More importantly it will be used as the parameter for the multinomial distribution used to identify the topic of the next word. To clarify, the selected topic’s word distribution will then be used to select a word w.

  • phi (\(\phi\)) : Is the word distribution of each topic, i.e. the probability of each word in the vocabulary being generated if a given topic, z (z ranges from 1 to k), is selected.

  • xi (\(\xi\)) : In the case of a variable lenght document, the document length is determined by sampling from a Poisson distribution with an average length of \(\xi\)

  • k : Topic index
  • z : Topic selected for the next word to be generated.
  • w : Generated Word
  • d : Current Document

Outside of the variables above all the distributions should be familiar from the previous chapter.

5.1.1 Selecting Parameters

The intent of this section is not aimed at delving into different methods of parameter estimation for \(\alpha\) and \(\beta\), but to give a general understanding of how those values effect your model. For ease of understanding I will also stick with an assumption of symmetry, i.e. all values in \(\overrightarrow{\alpha}\) are equal to one another and all values in \(\overrightarrow{\beta}\) are equal to one another. Symmetry can be thought of as each topic having equal probability in each document for \(\alpha\) and each word having an equal probability in \(\beta\).

In previous sections we have outlined how the \(alpha\) parameters effect a Dirichlet distribution, but now it is time to connect the dots to how this effects our documents.

5.2 Generative Model

LDA is know as a generative model. What is a generative model? Approaches that explicitly or implicitly model the distribution of inputs as well as outputs are known as generative models, because by sampling from them it is possible to generate synthetic data points in the input space (Bishop 2006). This means we can create documents with a mixture of topics and a mixture of words based on thosed topics. Let’s start off with a simple example of generating unigrams.

5.2.1 Generating Documents

5.2.1.1 Topic Word Mixtures, Document Topic Mixtures, and Document Length Static

Building on the document generating model in chapter two, let’s try to create documents that have words drawn from more than one topic. To clarify the contraints of the model will be:

  • set number of topics (2)
  • constant topic distributions in each document
  • constant word distribution in each topic

Known values:

  • 2 topics : word distributions of each topic below
    • \(\phi_{1}\) = [ 📕 = 0.8, 📘 = 0.2, 📗 = 0.0 ]
    • \(\phi_{2}\) = [ 📕 = 0.2, 📘 = 0.1, 📗 = 0.7 ]
  • All Documents have same topic distribution:
    • \(\theta = [ topic \hspace{2mm} a = 0.5,\hspace{2mm} topic \hspace{2mm} b = 0.5 ]\)
  • All Documents contain 10 words

Generative Model Pseudocode

  • For d = 1 to D where D is the number of documents
    • For w = 1 to W where W is the number of words in document d
      • Select the topic for word w
      • \(z_{i}\) ~ Multinomial(\(\theta_{d}\))
      • Select word based on topic z’s word distribution
      • \(w_{i}\) ~ Multinomial(\(\phi^{(z_{i})}\))
k <- 2 # number of topics
M <- 10 # let's create 10 documents
vocab <- c('\U1F4D8', '\U1F4D5', '\U1F4D7')
alphas <- rep(1,k) # topic document dirichlet parameters

phi <- matrix(c(0.1, 0, 0.9,
                0.4, 0.4, 0.2), 
              nrow = k, 
              ncol = length(vocab), 
              byrow = TRUE)

theta <- c(0.5, 0.5)

N <- 10 #words in each document
ds <-tibble(doc_id = rep(0,N*M), 
            word = rep('', N*M),
            topic = rep(0, N*M)
            ) 
            
row_index <- 1
for(m in 1:M){
  for(n in 1:N){
    # sample topic index , i.e. select topic
    topic <- which(rmultinom(1,1,theta)==1)
    # sample word from topic
    new_word <- vocab[which(rmultinom(1,1,phi[topic, ])==1)]
    ds[row_index,] <- c(m,new_word, topic)
    row_index <- row_index + 1
  }
}

ds %>% group_by(doc_id) %>% summarise(
  tokens = paste(word, collapse = ' ')
) %>% kable()
doc_id tokens
1 📘 📕 📗 📘 📘 📘 📗 📘 📗 📗
10 📗 📘 📕 📘 📘 📗 📕 📕 📗 📗
2 📗 📗 📘 📗 📗 📗 📗 📗 📘 📗
3 📕 📗 📗 📗 📗 📕 📗 📕 📕 📗
4 📗 📘 📗 📗 📗 📗 📕 📗 📘 📗
5 📕 📗 📗 📘 📗 📘 📘 📗 📗 📗
6 📗 📘 📗 📘 📗 📕 📕 📗 📗 📗
7 📘 📗 📗 📗 📗 📗 📕 📕 📗 📕
8 📗 📗 📗 📗 📗 📗 📘 📕 📗 📕
9 📗 📘 📗 📗 📗 📗 📘 📘 📘 📕

5.2.1.2 Topic Word Mixtures & Document Topic Mixtures Static, Document Length Varying

This next example is going to be very similar, but it now allows for varying document length. The length of each document is determined by a Poisson distribution with an average document length of 10.

Known values:

  • 2 topics : word distributions of each topic below
    • \(\phi_{1}\) = [ 📕 = 0.8, 📘 = 0.2, 📗 = 0.0 ]
    • \(\phi_{2}\) = [ 📕 = 0.2, 📘 = 0.1, 📗 = 0.7 ]
  • All Documents have same topic distribution:
    • \(\theta = [ topic \hspace{2mm} a = 0.5,\hspace{2mm} topic \hspace{2mm} b = 0.5 ]\)

Generative Model Pseudocode

  • For d = 1 to D where D is the number of documents
    • Determine length of document
    • \(W\) ~ Poisson(\(\xi\))
    • For w = 1 to W where W is the number of words in document d
      • Select the topic for word w
      • \(z_{i}\) ~ Multinomial(\(\theta_{d}\))
      • Select word based on topic z’s word distribution
      • \(w_{i}\) ~ Multinomial(\(\phi^{(z_{i})}\))
k <- 2 # number of topics
M <- 10 # let's create 10 documents
vocab <- c('\U1F4D8', '\U1F4D5', '\U1F4D7')
alphas <- rep(1,k) # topic document dirichlet parameters

phi <- matrix(c(0.1, 0, 0.9,
                0.4, 0.4, 0.2), 
              nrow = k, 
              ncol = length(vocab), 
              byrow = TRUE)

theta <- c(0.5, 0.5)
xi <- 10 # average document length 
N <- rpois(M, xi) #words in each document
ds <-tibble(doc_id = rep(0,sum(N)), 
            word   = rep('', sum(N)),
            topic  = rep(0, sum(N))
            ) 
            
row_index <- 1
for(m in 1:M){
  for(n in 1:N[m]){
    # sample topic index , i.e. select topic
    topic <- which(rmultinom(1,1,theta)==1)
    # sample word from topic
    new_word <- vocab[which(rmultinom(1,1,phi[topic, ])==1)]
    ds[row_index,] <- c(m,new_word, topic)
    row_index <- row_index + 1
  }
}

ds %>% group_by(doc_id) %>% summarise(
  tokens = paste(word, collapse = ' ')
) %>% kable()
doc_id tokens
1 📗 📘 📗 📘 📗 📗 📗 📕 📕 📗 📗 📘
10 📕 📕 📘 📗 📘 📗 📗 📗 📕 📗
2 📗 📗 📗 📗 📗 📕 📗 📘 📗 📕 📗 📘 📗
3 📕 📗 📗 📕 📘 📘 📗 📘 📕 📘 📕 📘 📕 📘 📘 📗
4 📗 📕 📕 📗 📘 📗 📘 📘 📘 📕 📗 📘 📗 📕 📘 📕
5 📗 📗 📗 📗 📗 📘 📕 📗 📗 📘 📕 📗 📗 📘 📗
6 📗 📘 📘 📗 📗 📗 📗 📘 📗 📗 📗 📗 📕 📗
7 📘 📘 📘 📗 📕 📗 📘 📘 📗 📘 📘 📗 📕 📘 📘 📗 📗 📕 📗 📗 📘 📗
8 📘 📗 📗 📕 📕 📗 📗 📕 📗 📗 📗 📕 📗 📗 📗 📕 📕 📕 📘
9 📗 📗 📗 📗 📗 📗 📗 📗 📗 📗 📗 📗 📗 📗

5.2.1.3 Topic Word Mixtures Static, Varying Document Topic Distributions and Document Length

So this time we will introduce documents with different topic distributions and length.The word distributions for each topic are still fixed.

Known values:

  • 2 topics : word distributions of each topic below
    • \(\phi_{1}\) = [ 📕 = 0.8, 📘 = 0.2, 📗 = 0.0 ]
    • \(\phi_{2}\) = [ 📕 = 0.2, 📘 = 0.1, 📗 = 0.7 ]

Generative Model Pseudocode

  • For d = 1 to D where number of documents is D
    • Sample parameters for document topic distribution
    • \(\theta_{d}\) ~ Dirichlet(\(\alpha\))
    • For w = 1 to W where W is the number of words in document d
      • Select the topic for word w
      • \(z_{i}\) ~ Multinomial(\(\theta_{d}\))
      • Select word based on topic z’s word distribution
      • \(w_{i}\) ~ Multinomial(\(\phi^{(z_{i})}\))
k <- 2 # number of topics
M <- 10 # let's create 10 documents
vocab <- c('\U1F4D8', '\U1F4D5', '\U1F4D7')
alphas <- rep(1,k) # topic document dirichlet parameters



phi <- matrix(c(0.1, 0, 0.9,
                0.4, 0.4, 0.2), 
              nrow = k, 
              ncol = length(vocab), 
              byrow = TRUE)


xi <- 10 # average document length 
N <- rpois(M, xi) #words in each document
ds <-tibble(doc_id = rep(0,sum(N)), 
            word   = rep('', sum(N)),
            topic  = rep(0, sum(N)), 
            theta_a = rep(0, sum(N)),
            theta_b = rep(0, sum(N))
            ) 
            
row_index <- 1
for(m in 1:M){
  theta <-  rdirichlet(1, alphas)
  
  for(n in 1:N[m]){
    # sample topic index , i.e. select topic
    topic <- which(rmultinom(1,1,theta)==1)
    # sample word from topic
    new_word <- vocab[which(rmultinom(1,1,phi[topic, ])==1)]
    ds[row_index,] <- c(m,new_word, topic,theta)
    row_index <- row_index + 1
  }
}

ds %>% group_by(doc_id) %>% summarise(
  tokens = paste(word, collapse = ' '), 
  topic_a = round(as.numeric(unique(theta_a)), 2), 
  topic_b = round(as.numeric(unique(theta_b)), 2) 
) %>% kable()
doc_id tokens topic_a topic_b
1 📕 📕 📗 📕 📗 📘 📗 📗 📗 📗 📗 0.08 0.92
10 📘 📕 📘 📘 📘 📘 📘 📘 📗 📘 0.11 0.89
2 📘 📗 📗 📗 📘 📕 📗 📕 📘 0.39 0.61
3 📗 📗 📗 📗 📗 📗 📘 📗 📘 📗 📗 0.43 0.57
4 📗 📘 📕 📕 📘 📗 📗 📕 📘 📕 📗 0.40 0.60
5 📗 📘 📘 📗 0.82 0.18
6 📘 📗 📘 📕 📗 📕 📘 📘 📕 📗 📗 📕 📗 📘 📕 0.11 0.89
7 📗 📗 📗 📗 📗 📘 📗 0.66 0.34
8 📘 📘 📗 📗 📗 📗 📗 📗 📗 📗 📗 📗 📗 0.68 0.32
9 📕 📘 📕 📘 📗 📘 📗 📕 📘 📗 📘 📕 0.01 0.99

5.2.2 LDA Generative Model

We are finally at the full generative model for LDA. The word distributions for each topic vary based on a dirichlet distribtion, as do the topic distribution for each document, and the document length is drawn from a Poisson distribution.

Generative Model Pseudocode

  1. For k = 1 to K where K is the total number of topics
    • Sample parameters for word distribution of each topic
    • \(\phi^{(k)}\) ~ Dirichlet(\(\beta\))
  2. For d = 1 to D where number of documents is D
    • Sample parameters for document topic distribution
    • \(\theta_{d}\) ~ Dirichlet(\(\alpha\))
    • For w = 1 to W where W is the number of words in document d
      • Select the topic for word w
      • \(z_{i}\) ~ Multinomial(\(\theta_{d}\))
      • Select word based on topic z’s word distribution
      • \(w_{i}\) ~ Multinomial(\(\phi^{(z_{i})}\))
k <- 2 # number of topics
M <- 10 # let's create 10 documents
vocab <- c('\U1F4D8', '\U1F4D5', '\U1F4D7')
alphas <- rep(1,k) # topic document dirichlet parameters


betas <- rep(1,length(vocab)) # dirichlet parameters for topic word distributions
phi <- rdirichlet(k, betas)


xi <- 10 # average document length 
N <- rpois(M, xi) #words in each document
ds <-tibble(doc_id = rep(0,sum(N)), 
            word   = rep('', sum(N)),
            topic  = rep(0, sum(N)), 
            theta_a = rep(0, sum(N)),
            theta_b = rep(0, sum(N))
            ) 
            
row_index <- 1
for(m in 1:M){
  theta <-  rdirichlet(1, alphas)
  
  for(n in 1:N[m]){
    # sample topic index , i.e. select topic
    topic <- which(rmultinom(1,1,theta)==1)
    # sample word from topic
    new_word <- vocab[which(rmultinom(1,1,phi[topic, ])==1)]
    ds[row_index,] <- c(m,new_word, topic,theta)
    row_index <- row_index + 1
  }
}

ds %>% group_by(doc_id, topic) %>% summarise(
  tokens = paste(word, collapse = ' '), 
  topic_a = round(as.numeric(unique(theta_a)), 2), 
  topic_b = round(as.numeric(unique(theta_b)), 2) 
) %>% kable() 
doc_id topic tokens topic_a topic_b
1 1 📕 📘 📘 📕 📗 📘 📘 📕 📕 📗 📕 0.75 0.25
1 2 📕 📕 📘 0.75 0.25
10 2 📘 📘 📘 📘 📘 📘 0.08 0.92
2 1 📕 📕 📗 📕 📕 📗 📘 0.56 0.44
2 2 📕 📘 📗 0.56 0.44
3 2 📘 📗 📘 📗 📘 📘 📕 📕 📘 0.01 0.99
4 1 📕 0.13 0.87
4 2 📕 📘 📗 📗 📕 📕 📘 0.13 0.87
5 1 📗 📗 📗 📕 📘 📘 📕 📘 📕 📗 📘 📗 📘 📘 📗 0.87 0.13
5 2 📘 📕 0.87 0.13
6 1 📕 0.26 0.74
6 2 📘 📘 📕 📗 📘 📘 📕 📘 0.26 0.74
7 1 📘 📕 0.13 0.87
7 2 📘 📘 📗 📕 📘 📘 📗 0.13 0.87
8 1 📗 📘 📕 📕 📕 📗 📕 0.60 0.40
8 2 📕 📘 📘 📘 📕 0.60 0.40
9 1 📕 📘 📘 📗 📗 📕 📘 📘 📘 📘 📕 📗 0.77 0.23
9 2 📘 📗 0.77 0.23

The LDA generative process for each document is shown below(Darling 2011):

\[ \begin{equation} p(w,z,\theta,\phi|\alpha, B) = p(\phi|B)p(\theta|\alpha)p(z|\theta)p(w|\phi_{z}) \tag{5.1} \end{equation} \]

You may be like me and have a hard time seeing how we get to the equation above and what it even means. If we look back at the pseudo code for the LDA model it is a bit easier to see how we got here. We start by giving a probability of a topic for each word in the vocabulary, \(\phi\). This value is drawn randomly from a dirichlet distribution with the parameter \(\beta\) giving us our first term \(p(\phi|\beta)\). The next step is generating documents which starts by calculating the topic mixture of the document, \(\theta_{d}\) generated from a dirichlet distribution with the parameter \(\alpha\). This is our second term \(p(\theta|\alpha)\). You can see the following two terms also follow this trend. The topic, z, of the next word is drawn from a multinomial distribuiton with the parameter \(\theta\). Once we know z, we use the distribution of words in topic z, \(\phi_{z}\), to determine the word that is generated.