Gensim is an easy to implement, fast, and efficient tool for topic modeling. The purpose of this post is to share a few of the things I’ve learned while trying to implement Latent Dirichlet Allocation (LDA) on different corpora of varying sizes. This post is not meant to be a full tutorial on LDA in Gensim, but as a supplement to help navigate around any issues you may run into. If you are getting started with Gensim, or just need a refresher, I would suggest taking a look at their excellent documentation and tutorials.
The Gensim Google Group is a great resource. Most of the information in this post was derived from searching through the group discussions. If you are having issues I’d highly recommend searching the group before doing anything else. Also make sure to check out the FAQ and Recipes Github Wiki.
Logging Logging Logging
When training models in Gensim, you will not see anything printed to the screen. LDA, depending on corpus size may take a few minutes, hours, or even days, so it is extremely important to have some information about the progress of the procedure. Gensim does not log progress of the training procedure by default. The python logging can be set up to either dump logs to an external file or to the terminal.
Building the dictionary and corpus
One of the primary strengths of Gensim that it doesn’t require the entire corpus be loaded into memory. You can also build a dictionary without loading all your data into memory. All of this is summarised in the Corpora and Vector Spaces Tutorial. I don’t have much to add here except the following: save and save_as_text are not interchangeable (this also goes for load and load_as_text)
What is the difference between save_as_text and save?
save_as_text is meant for human inspection while save is the preferred method of saving objects in Gensim. This also applies to load and load_from_text. But there is one additional caveat, some Dictionary methods will not work with objects that were saved/loaded from text such as filter_extremes and num_docs. In short if you use save/load you will be able to process the dictionary at a later time, but this is not true with save_as_text/load_from_text.
Gensim can only do so much to limit the amount of memory used by your analysis. Your program may take an extended amount of time or possibly crash if you do not take into account the amount of memory the program will consume. Prior to training your model you can get a ballpark estimate of memory use by using the following formula:
8 bytes * num_terms * num_topics * 3
- 8 bytes: size of double precision float
- num_terms: number of terms in the dictionary
- num_topics: number of topics
- The magic number 3: The 8 bytes * num_terms * num_topic accounts for the model output, but Gensim will need to make temporary copies while modeling. The scaling factor of 3 gives you an idea of how much memory Gensim will be consuming while running with the temporary copies present.
There are multiple filtering methods available in Gensim that can cut down the number of terms in your dictionary. If you are unsure of how many terms your dictionary contains you can take a look at it by printing the dictionary object after it is created/loaded.
Most of the Gensim documentation shows 100k terms as the suggested maximum number of terms; it is also the default value for keep_n argument of filter_extremes. The other options for decreasing the amount of memory usage are limiting the number of topics or get more RAM.
Filtering A Previosly Saved Dictionary and Updating the Corpus
If you need to filter your dictionary and update the corpus after the dictionary and corpus have been saved, take a look at the link below to avoid any issues:
I find it useful to save the complete, unfiltered dictionary and corpus, then I can use the steps in the previous link to try out several different filtering methods.
Training the LDA Model
If you follow the tutorials the process of setting up lda model training is fairly straight forward. The one thing that took me a bit to wrap my head around was the relationship between chunksize, passes, and update_every.
Chunksize, Passes, and Update_every
- passes: Number of passes through the entire corpus
- chunksize: Number of documents to load into memory at a time and process E step of EM.
- update_every: number of chunks to process prior to moving onto the M step of EM.
I’m not going to go into the details of EM/Variational Bayes here, but if you are curious check out this google forum post and the paper it references here. In general a chunksize of 100k and update_every set to 1 is equivalent to a chunksize of 50k and update_every set to 2. The primary difference is that you will save some memory using the smaller chunksize, but you will be doing multiple loading/processing steps prior to moving onto the maximization step. Passes are not related to chunksize or update_every. Passes is the number of times you want to go through the entire corpus. Below are a few examples of different combinations of the 3 parameters and the number of online training updates which will occur while training LDA.
- chunksize = 100k, update_every=1, corpus = 1M docs, passes =1 : 10 updates total
- chunksize = 50k , update_every=2, corpus = 1M docs, passes =1 : 10 updates total
- chunksize = 100k, update_every=1, corpus = 1M docs, passes =2 : 20 updates total
- chunksize = 100k, update_every=1, corpus = 1M docs, passes =4 : 40 updates total
Additional considerations for LdaMulticore
If you are going to implement the LdaMulticore model, the multicore version of LDA, be aware of the limitations of python’s multiprocessing library which Gensim relies on. Again, this goes back to being aware of your memory usage. If the following is True you may run into this issue:
8 bytes * num_terms * num_topics >= 1GB
The only way to get around this is to limit the number of topics or terms.
Hopefully this post will save you a few minutes if you run into any issues while training your Gensim LDA model. Please make sure to check out the links below for Gensim news, documentation, tutorials, and troubleshooting resources: