Design a site like this with
Get started

Topic Stability

I final diagnostic I would propose is “topic stability.” As I have mentioned, topic modeling depends on a random procedure, meaning every time you run it you will get slightly different topics. That’s why it is so important to set your random seed in the code above — that way someone else can produce exactly the same model as you.

But this doesn’t address the question of whether the differences between models are significant. I.e. if I do not set the seed, how much variation will I see between topics? The odds are very likely that you will see a lot of variation, which is yet another reason to be hesitant about using topic models in an explanatory setting.

What can we do about this? Well, the best suggestion I have right now is to assess how much variability your topic of interest has relative to other topics in the model. While there is no way of comparing models to models (i.e. my model is more stable across multiple runs than another set of documents), you can at least assess and hierarchize which topics are more stable in a single model. As above, this can help you see how your topic is behaving with respect to other topics. It can help you see the limitations of the conclusions you can draw.

Let me go into a detailed case study to make this clearer. 

To begin, let me summarize what I see as the three basic challenges when topic modelling a corpus:

  • topic number — how to choose the best number of topics (“k”)
  • topic coherence — how to understand what is inside a topic
  • topic stability — how to to understand how stable a topic is across multiple models of the same k

A lot of discussion around topic modelling revolves around #1, as in how many topics do you choose. Reviewing the literature, this is often boiled down to “domain expertise.” I know this data set and I feel this number of topics is the best representation. I find that a frustrating answer when I observe how different the topics are at different levels of k. What I think I see happening is a process of decreasing generality as certain topics splinter into more specific versions of themselves. But I’m not sure how to link that a) to a theory of “topics” and b) to some sort of replicable process (i.e. quantitative).

In the above section, you have seen how to address #2. Now I’m going to focus on #3: what happens when you run multiple models at the same k but use different seeds? As practitioners know, you always get slightly different topics. So far I don’t know of any literature that investigates that “slightly.” If you know of some, please post here.

The case study I’m going to use is a collection of 60k academic articles in the field of literary studies. Topic modeling seems very appropriate for this corpus. After running initial models (and cleaning my words, again and again and again) I saw very coherent, well differentiated topics at the 50 and 60 topic level. As a “domain expert” I have some confidence that these vocabulary distributions meaningfully correlate with my own understanding of important topics in the field. Academic articles feel “topical” in that they are oriented around clusters of terms or concepts so that modelling them in this way makes sense to me.

Since choosing a single model based on a single k is at this point still ultimately arbitrary my goal here is to better understand the stakes of the variability that exists between different runs at the same k. My assumption is that as you increase your k, the overall variability between runs should increase. Your topics will be more specific and thus subject to more randomness depending on the starting point. That’s probably not that informative.

But if you want to zoom in and talk about a specific topic, this variability seems important to discuss.

Take a look at this example to see why I think this matters so much (and of course maybe it doesn’t):

Topic 21: cultural, culture, identity, social, discourse, political, within, power, politics, studies, community, practices, ways, self, forms, practice, terms, rather, difference, critical

If asked, I would label this topic the “cultural studies” topic. It seems to be about questions of social discourse as well as identity politics, both associated with the cultural turn. If you examine when this topic becomes prominent, the timeline also makes sense:

But notice how in the second run this topic has a very different semantic orientation.

Topic 44: cultural, culture, identity, national, studies, chinese, world, cultures, western, global, ethnic, within, community, nation, japanese, china, practices, different, states, american

Instead of being about culture, discourse, and identity, it appears to be more about culture in the sense of national communities. The temporal signature is roughly the same (a few years later, a less strong weight overall). But I now have at my disposal two different narratives about disciplinary change — one hinges on an idea of cultural studies as a more general project concerned with discourse, power and identity, while the other appears to be concerned with different national identities.

Here is a list of the top 10 journals associated with these two versions:

Cultural CritiqueThe Journal of American Folklore
Signsboundary 2
Social TextSocial Text
boundary 2Cultural Critique
American Literary HistorySigns
The Journal of American FolkloreAmerican Literary History
Studies in American Indian LiteraturesStudies in American Indian Literatures
Victorian StudiesCaribbean Quarterly
PMLAFeminist Studies
DiacriticsComparative Literature Studies

Here we see how the nationalism model favours journals like Folklore, Caribbean Quarterly, American Indian Literatures, and Comparative Literature versus PMLA, Diacritics, and Victorian Studies for the identity model.

If we look at word clouds of the titles of the top 50 articles associated with this topic we see pretty difference emphases.

I guess the larger point is that these topics are very much related, but not identical. They emphasize different aspects of cultural studies. For very general questions like “when does this topic gain in currency” maybe this doesn’t matter much (maybe?). But if you are going to tell a story about “cultural studies” then it matters which version of cultural studies one looks at and also how stable that definition is.

In this next section I will give you code you can use to assess the “stability” of your topic across multiple runs. I will continue using the novel data we have used so far since it is relatively small size and easy-ish to handle. If I continued using the data above, I’d be able to show you that the cultural studies topic was “middle” in terms of its stability. In other words, that variation was significant enough to mean that it wasn’t highly stable, but it wasn’t as bad as some other topics, which frankly would just appear as random sets of words on every run, with little or no similarity.

Turning to the novel data, I’m going to make the following assumptions. You now know how to run a topic model and you know how to generate the posterior probabilities of topics and words (i.e. the likelihood of a word being in a topic and a topic in a document). We’re going to use those outputs for this next section. So go ahead and run your model 21 times using the same k. For every output, you want to save the variable


and write that as a .csv file

            write.csv(topic_word_probs, file=”model_1.csv”)

Choose the one you want to work with, i.e. imagine you were going to study this model in depth.

Name it: model_original.csv.

What we are now going to do is estimate, for each topic in your chosen model, how similar that topic is on average to its most similar topic in all other models. In other words, we will go through every model, find the topic that looks most like it, calculate a “similarity” score, and then average that score for all models. The more stable a topic is the higher it’s overall average similarity score will be. Then we can see which topics are very stable and which are very random.

The first thing you need to know is that topic numbers do not mean anything, so the words associated with Topic 1 in model 1 will not be associated with Topic 1 in model 2. However, the broad contours of Topic 1 in model 1 should have a “friend”, i.e. a similar topic, in model 2, but it could be any number. Thus we need to scan through all topics for every model to find the nearest topic. 

To calculate how similar two topics are to each other, we are going to use the measure of Kulback-Leibler divergence (KLD). KLD is great because it measures how much information is lost when you compare two probability distributions. And we are modeling our topics as exactly that: each topic is a vector of probabilities of words being included in that topic. So KLD calculates how much divergence, or information loss, there is when we approximate our target topic by every other possible topic.

Since we are using KLD, we need to load the entropy library:


Then get a list of the models you want to test:




Then load your primary model, i.e. the one you want to use. Rows are words, columns are topics, values = probability of word being in topic.


Remove the first column


Transpose columns and rows


Rename columns


Then run a loop that goes through each model and identifies the nearest topic for that model and then provides the avg. KLD score for each topic in your primary model. 


#run for every topic

for (i in 1:ncol(twp)){


  #subset by ith topic


  #go through each model and find most similar topic


  #run through all but final model, which is your original

  for (j in 1:(length(filenames)-1)){

    #load next model






    #comp should now mirror twp

    #go through every topic in comp to find most similar topic in twp

    #calculate KLD for every topic pair with the ith topic from primary model


    for (k in 1:ncol(comp)){

      kld.v[k]<-KL.plugin(sub1, comp[,k])


    #find minimum value, i.e. most similar topic

    top.t<-which(kld.v == min(kld.v))

    #which model


    #what was the divergence?

    kld.score<-kld.v[which(kld.v == min(kld.v))]

    #create data frame

    temp.df<-data.frame(model, top.t, kld.score)

    test.t<-rbind(test.t, temp.df)


  #calculate mean and sd for the ith topic compared to best topic of all other models




  temp.df<-data.frame(topic, mean.kld, sd.kld)

  stable.df<-rbind(stable.df, temp.df)


The output table here, stable.df, allows you to rank your topics by their stability. The lower the KLD, the more similar a topic is across all models. Thus we see that topic 2 (now, upon, like, long, water, seemed) is one of the most stable, while topic 18 is the least stable (said, sir, lady, lord). The family topic we were interested in before falls very much in the middle of things in terms of stability.

The major caveat here is that we don’t have a lot of texts here (150) and there is a lot of repetition since chunks come from the same books. These topics feel weak to me overall and thus their variability or stability should be taken with a grain of salt.In the next section I will talk about some exploratory analytical things you can do with topic models.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: