What are some of the problems with topic modeling?

Topic modeling has become very popular because it often achieves a decent approximation of what we intuitively refer to as “topics.” The fundamental limitation of topic modeling is that it is a tool that is in search of a theoretical construct. What do I mean by this? If we return to the modeling section at the beginning of this book, there I tried to show how a good research process does not begin with a tool, but with a question and a theoretical construct of some kind. Today, I happen to be interested in studying narrative causality. Thus, “narrative causality” is a theoretical construct that I would like to investigate in texts. I then need to build a measurable approximation of this idea and validate how well I’ve approximated this idea with my measurement. There is no “tool” of narrative causality that allows me to study causality. It needs to be constructed to study this idea.

With topic modeling, we have done the reverse. Engineers have created an algorithm that generates lists of words that are associated with each other according to certain quantitative criteria and then we have attached the theoretical construct called a “topic” to these lists of words. What do we mean by topics? How are LDA topics related to the long history of the indexing of words in documents (think humanist commonplacing which attached passages to keywords, much in the way we associate documents with topics). Indeed, as I’ve explored in greater depth in Enumerations, we don’t yet really know what kind of linguistic object a topic is when produced by a topic model.^[1] This is partially the fault of attaching a construct to a tool (instead of the other way around) and partially the fault of textual scholars who work with very loose concepts — what is the difference between topics, themes, and say “discourse”? It’s precisely these vagaries that need to be specified in advance of a research project and then construct a tool that approximates whatever definition you’ve come up with.

Add to all this that topic models can generate linguistic objects of very different natures (even though you use the same algorithm, different parameters produce strongly different outcomes). In other words, the idea of a “topic” is fluid when it comes to topic modeling, which is extremely problematic if you wish to make an argument about some historical or cultural process associated with topics.

To be more specific, one of the central issues surrounding topics as they are generated by LDA is that we haven’t defined the scale of generality to which topics refer. What do I mean by this? Well, a topic about love could consist of very high-level words like passion, affection, devotion, etc. (I just ripped these from a thesaurus, which I hope you can see raises a fundamental question: why not use a thesaurus?). Or it might consist of more concrete words associated with places or actions of love, like bed, sofa, floor, letter, etc. If we want to talk about the topic of “love,” how can we specify which level of scale we are talking about, especially if a topic contains a mixture of these words? In other words, there are some tactics we can use to try to make topics more “specific” or “general,” but there is no guaranteed consistency among the semantic relations between topic words other than that they were generated by the same algorithm.

But isn’t the point to infer associations of words latent in the text? That’s why it’s called latent dirichelet allocation! Yes indeed. The power of topic modeling is that it constructs topics that are specific to a set of documents. This is the context in which “love” is used for this set of documents. But that’s a different question than testing something about the behavior of the “topic of love” in a bunch of documents. (All this love talk is making me hungry…huh?) In other words, because we don’t define our construct in advance, we are applying a concept after the fact to fit the data we are observing. This is problematic to say the least.

Indeed, if you care the semantic behavior of an individual keyword, like love, then it might be better to observe its behavior using something like word embeddings discussed in subsequent sections. In other words, it is very important to have a handle on what you want to study before you study it. Is it a keyword like love whose semantic behavior, and potential change, you want to understand? Or is it a family of words associated with a looser concept like gender, where the individual word isn’t all that important?

So don’t use it at all? No, not exactly. It is probably best to think of topic modeling not as a way to test “topics” in your documents, but as a way of generating insights about particular semantic behavior within them. This is a slight difference, but the key is to see the latter exercise as a form of “exploratory” data analysis rather than “explanatory.” Topic modeling can reveal patterns and initiate questions, but it is less appropriate to test and confirm them. I will show examples of how topic modeling can be used to test hypotheses, but I think all this verbiage here ought to caution you about using this as your sole avenue to understanding or explaining your data. And finally, a lot depends on the latent “topicality” of your documents. News articles or academic articles, for example, are very “topical,” in the sense that they often concern a single unifying idea or issue. Things like novels, less so.

In what follows I will walk you through how to prepare your data for topic modeling, how to run a model, and then also how to assess your model and any given “topic” you may choose to analyze.

Share this:

Leave a comment Cancel reply