Assessing individual topics: Some initial diagnostics

So far we have been discussing the overall model and its fit with your documents. However, it is often the case that researchers want to study the behavior of individual topics. The whole model is useful as a form of dimensionality reduction, but each topic contains a wealth of information in itself. The question is then, what do you need to watch out for when making inferences from specific topics?

In this section I am going to offer a few diagnostic measures you can use to better understand any single topic and its behavior within your corpus. I think these steps are important to take before you conclude anything about your topic of interest because they can help you contextualize the topic within your corpus and also assess the degree of its stability.” I’ll talk more about this idea in a moment, but essentially, if we know that topic modeling can produce different results with different random starting points, should we believe anything concluded from a single iteration?

First, some more straightforward diagnostics presented here as a table, drawn from topic 9 again (the so-called family topic).

Going down the list in order, the first measure (# tokens) tells us how many overall words are accounted for by the top twenty word types in your topic. This gives you a sense of how strong the topic is within the collection. Is it a dominant, middling, or weak topic in the sense of its overall lexical presence? Before you make claims about how important this topic is, it is important to know just how prevalent it is to begin with. To do this:

Define your topic

topic.no<-9

Subset the overall DTM by the top 20 words from that topic and sum their overall frequency in the corpus.

tok.sub<-corpus2[,which(colnames(corpus2) %in% as.character(term_dis[,topic.no]))]

no.tokens<-sum(tok.sub)

The answer in this case is 233,712, which ranks second to last in terms of being the least well-represented topic. In other words, this topic is one of the smallest in terms of number of words in the corpus.

The second row (# passages) tells us how many documents have this topic present above some artificial probability threshold, that is, the number of passages where this topic is “strongly” present. The reason to calculate this is to assess how well-distributed the topic is across the corpus.

First, subset the probability table by the topic. This gives you a two column table where one column is the document title and the other is the probability of the topic being in that document:

doc.sub<-data.frame(row.names(topic_doc_probs), topic_doc_probs[,topic.no])

Then keep only documents where the topic is “strongly” present. For our purposes I am going to define “strong” as a probability that is two standard deviations above the overall mean probability for the entire model. The positive side of this method is that it doesn’t condition on the most important topic to a given document but rather looks at documents where this topic is highly likely even if other topics are even more likely in that document. The downside is that it is an arbitrary threshold pegged to the overall model. One could also create a cut-off for each topic, but I prefer using values relative to the whole model.

To remove documents that fall below the cut:

cut<-mean(as.matrix(topic_doc_probs))+(2*sd(as.matrix(topic_doc_probs)))

doc.sub<-doc.sub[which(doc.sub$topic_doc_probs…topic.no. > cut),]

Calculate the number of documents remaining:

no.docs<-nrow(doc.sub)

You should get 412, or roughly 2.2% of all documents.

Next we can calculate how concentrated these documents are within a single novel. I.e. do most/many come from a single novel or are they well-distributed across the corpus?

First, calculate how many novels the chunks come from. To do so, we are going to split up the filename which contains the novel’s title.

nov.sub<-cSplit(doc.sub, “row.names.topic_doc_probs.”, sep=”_”)

Then we will count how many unique titles there are.

no.novels<-nlevels(factor(nov.sub$row.names.topic_doc_probs._4))

Here we get 73, which ranks tied for fifth at the bottom. This topic is not only small, it is one of the least well-represented across novels.

A second measure you can use is a concentration ratio. It simply looks at how many of the documents that exhibit this topic come from the single most dominant novel associated with that topic. In this case the answer is 18.9%, meaning 78 of the 412 documents that exhibit this topic strongly come from the novel John Halifax, Gentleman (1856), by Dinah Craik. This is not surprisingly the second most concentrated topic.

The average date (avg.date) gives us a sense of the temporal weight of the topic, while the standard deviation (sd.date) can help us see how broad the spread of dates are in the topic. A higher standard deviation suggests more temporal range for the topic. Here again we derive dates from the filenames, which contain metadata.

avg.date<-round(mean(nov.sub$row.names.topic_doc_probs._2))

sd.date<-round(sd(nov.sub$row.names.topic_doc_probs._2))

The final score is designed to give us some sense of the semantic coherence of the topic. “Coherence” is defined by David Mimno et al and measures the co-document frequency over document frequency for the top twenty topic words associated with the topic. What this means is that the more topic words appear together in documents as opposed to just in their own documents, the more “coherent” the topic is thought to be (and the more it correlates with expert opinion about the validity of the topics as evidenced by Mimno). In other words, you are examining how often two words both appear in a document together over against how often each appears by itself in documents. Thus the more words co-occur, the more “coherent” the topic is thought to be. “Coherence” measures the interwovenness of the topic among documents. In this case, topic 9 is the eighth least coherent topic.

The code for this is a bit more involved. Here it is:

First transpose your DTM so words are rows.

tdm<-t(corpus2)

Then only keep documents with the topic strongly present.

tdm<-tdm[,colnames(tdm) %in% as.character(doc.sub$row.names.topic_doc_probs.)]

Only keep the top 20 words for that topic.

tdm<-tdm[row.names(tdm) %in% as.character(term_dis[,topic.no]),]

Now create a co-occurrence matrix for those words.

russel.dist<-as.matrix(simil(tdm, method = “Russel”, convert_distances = TRUE))

Transform percentages into raw counts.

russel.final<-russel.dist*ncol(tdm)

Remove NA’s and turn into 0s:

russel.final[is.na(russel.final)]<-0

Create a loop that goes through and calculates the coherence score.

coherence.total<-0

for (k in 1:nrow(tdm)) {

doc.freq<-length(which(tdm[k,] != 0))

vec1<-0

for (m in 1:nrow(russel.final)) {

if (russel.final[k,m] != 0){

co.doc.freq<-as.integer(russel.final[k,m])

coherence1<-log((co.doc.freq+1)/doc.freq)

vec1<-vec1 + coherence1

}

coherence.total<-coherence.total + vec1

}

Once you have done this you now have a set of diagnostics to understand the role your topic plays in your corpus. Topic 9 is interesting from a conceptual point of view (“family”) but it is one of the smallest, least representative and least semantically coherent topics in our model. This gives us important caveats when we set out to explain its importance or meaning for the data we’re exploring.

Share this:

Leave a comment Cancel reply