Code associated with this section: 02_HB_TopicModeling.R

Topic modeling is the colloquiual term used for statistical models that try to estimate semantic associations between words (i.e. “topics”). The most popular model in use today is known as Latent Dirichlet Allocation (LDA), partly named after the German nineteenth-century mathematician Johann Peter Gustav Lejeune Dirichlet. (If you are a humanist then you’re probably interested in the historical context of data mining — where people like Dirichlet and Carl Friedrich Gauss, who came up with the mathematical formula for the “normal distribution” in statistics, are extremely important. Humanism is premised on the idea that historical context helps inform our understanding of the present. It’s important that this isn’t necessarily true, but it is the basic epistemological heuristic underlying the humanities.)

LDA identifies the likelihood of the co-occurrence of words within particular documents, assigning words to different groups (the topics) and then those topics to different documents. It is based on the assumption of distributional semantics discussed in the introduction to this section. The expectation of a topic model is that a word’s meaning is best represented by identifying those words that most often appear near it. A topic is defined as a generalized associational pattern of words.

The results of topic modeling are importantly predictive rather than descriptive. The presence of a word within a topic or a topic within a document is based on a probability – it is likely that two words from a given topic will appear together, but this is not necessarily the case. This is what accounts for the model’s generalizability. Instead of representing the actual distributions of words in documents, topic modeling attempts to identify a more limited number of semantic fields that are likely to appear within them.

There are two primary ways that topic modeling can be useful. The first is as a form of dimensionality reduction. As I’ve mentioned already, having thousands of dimensions (i.e. variables or features) can be counter-productive for certain kinds of statistical inference. Topic modeling reduces your high-dimensional data down to a smaller number, where the assumption is that those lower dimensions retain much (most, part of) the underlying information contained in the higher dimensions.

The second, and more commonly used way, that topic modeling is useful is as a tool of semantic analysis. When you built a dictionary of “positive language” in the previous section you were in effect constructing a “topic”. Topic modeling does something similar, only instead of doing this independently of the data it tries to infer “latent” semantic patterns that are in the data you are observing, which we in turn call topics.