In order to observe the words associated with each topic:
This shows the top twenty words associated with each topic ranked in the order of their likelihood being in that topic. If you want to see all words in order of their probability of being associated with a topic you change the integer above:
Here we just expand the number of words to include all words (the same as the number of columns in our data frame). The reason to focus on the top 20 is that the likelihood of words further down the list is very small. How small? Here is a graph of the probability of words being in topic 1. You can see how after the first few words the likelihood of those words actually being associated with Topic 1 are very small.
How did I generate this graph? I used the following code. To observe the probabilities associated with your words:
Select your topic:
Subset by your topic:
Make a graph:
title.g<-paste(“Word Probabilities\nTopic “, topic.no, sep=””)
plot(t(prob_sample), main=title.g, xlab=”Words”, ylab=”Probability”)
Inspect the probabilities of the top 20 words for this topic
As you can see by the twentieth word you are getting very close to the elbow in the graph above.
Let’s take a closer look at these lists of words. Remember, to do this in R Studio you just click on the variable in the “Environment” panel (“term_dis”).
The first thing you’ll notice is that each column represents a different topic. They aren’t named, but are simply numbered, 1-20. The second thing you’ll notice is that some look distinctly like “topics,” while others definitely don’t. Topic 9 (father, mother, little, old, poor, child) looks like a nuclear family topic. What is Topic 16 (will, can, shall, may, must)? Other than a helping verb list? Is that a “topic”?
There are many things you can do to improve the “topicality” of your topics. The first is add more data. Our 150 novel sample, while it results in ~18,000 chunks, is actually pretty small. Topic models work better the more data you give them because the word associations become clearer. And because the chunks are coming from the same novels, the model is sometimes learning word associations that are specific to individual novels. A better model might be to take a random sample of passages to avoid this problem (we’ll see this next).
Second, work on your vocabulary. There are still many words in here that I would consider cutting. “Said,” “one,” “ill” (contraction of I’ll) for example. Other researchers have also tried just conditioning on nouns as a way of profiling distinct topical realms. This would require running part of speech analysis, which we’ll learn next. More advanced models also try to remove “author” signature from models — if you have many books by the same author the model might learn vocabulary distinctive of that author, which is not a good way to generalize. Last, you might want to work on documents that are more topical! “Novels” are not necessarily “thematic,” or they are less thematically focused than other types of documents. This is actually an interesting area of study — if we talk about “topics” in novels, what do we actually mean? What are we trying to capture and are topic models the best way to do this?