The next step is to define the “parameters” of your model. These will definitely impact your results. The first parameter is the specified number of topics, which we call “k.” Why do we need to specify the number of topics in advance? Because the algorithm needs to know how many bins to try to model semantic relationships. This is one of the many reasons topic modeling is very subjective. What is the optimal k? Current research says that depending on your documents it can be a very wide range. There are some tools you can use to estimate the best k, but again since it is such a wide range the best way to do it is to run k at various levels and inspect your results. Start with 10 and go by increments of 10 to 100. The review each output. You will be able to tell when your topics begin to get more unstable and random and thus your k is too high and when your topics are too general and thus your k is too low. Again, no right answers here.

For now we will set k to 20:

k=20

The next parameter is “alpha.” Alpha determines whether you are looking for topics that are more unique to individual documents or whether you are looking for topics that are more well-distributed across the entire corpus. A low alpha will give you very distinct topics (documents will tend to have one strong topic and the rest are meaningless), while a high alpha will give you several topics that are associated in similar manner with a single document

According to the documentation that comes with the topicmodel library, 50/k is a recommended value. However I urge you to play with these parameters to see what happens.

The final parameter is the “seed.” Topic modeling is probabilistic meaning it starts at a random location and proceeds to iterate. This means that every run will be different because it starts at a different point. Setting the seed means that the next person who runs your model will get the exact same result as you. Let me say it again in larger font:

Always set your random seed! This allows your model to be reproducible by anyone else (assuming you have shared your underlying data and code of course!).

I will discuss in the next section how we can get certainty around model stability, i.e. how much do the topics change from one run to the next. For now, let’s get a single output and inspect and make sure we can do it again. Here we set the seed = 2 (pick your favorite number, mine is Platonic).

So we define our controls:

control_LDA_Gibbs<-list(alpha=(50/k), estimate.beta=TRUE, iter=1000, burnin=20, best=TRUE, seed=2)

Then we run our model with the following function:

topicmodel<-LDA(corpus2, method=”Gibbs”, k=k, control = control_LDA_Gibbs)

This can take a while, so be patient. On my machine, it took well over an hour or two (I forgot to measure it exactly, sorry). For this reason if you just want to inspect the output of a model without taking the time to run it, you can load the following workspace: “00_Fish_TopicModel_Novel150.RData.”