To be honest, 3,592 variables is still a lot. And we know for example there is a fair amount of redundancy in there. There are many words which belong to similar families or convey similar ideas. “Trees” and “shrubs” are both “plants.” There are many words that refer to body parts or particular types of action that could be grouped together. Indeed, this is one of the hardest things about text analysis — we generally want to make more general statements about what our documents are doing rather than focus on a single word’s behavior. But doing so isn’t straightforward. What we mean by a topic or concept or action is an ambiguous theoretical construct. How we operationalize those constructs impacts what we can know about our documents. The next two sections will give you ideas on how you can do this. The first uses pre-made dictionaries while the second uses the approach of topic modeling. We’ll see later how you can also use machine learning to tackle this problem.
The simplest way to do this is to use custom lists of words (dictionaries) that approximate your idea or concept. I’ll talk at the end about the pro’s and con’s of this approach, but for now let’s learn how to do it first. You can either use your own list of words or existing lists that have been used in other experiments and ideally validated on human readers. The example I’m going to use here relates to what is known as sentiment analysis. While there are newer and more effective ways of doing sentiment analysis using machine learning, for starters we will use the dictionary approach.
Sentiment analysis refers to the study of a document’s positivity or negativity (what is called “valence”). It was initially invented to track consumers’ feelings about products (“this movie sucks!”). It has since been applied to the study of voter preferences and plot analysis in literature, among other things. Knowing whether characters or historical figures or books are associated with more positive or negative language can be a useful way of inferring qualities about them.
So how does it work?
The good news is there are several lists out there that are freely available for use. The other good news is you have already learned the syntax for subsetting your document term matrix by a custom dictionary, which is what we are going to do here.
Let’s use the Bing sentiment dictionary, which contains two lists: one of positive words and one of negative words. There are many dictionaries out there with differing levels of complexity. I like Bing for starters because it has a binary structure (either you are positive or negative — or tertiary, because everything else is considered “neutral” by default). I’ve gone ahead and combined the positive and negative lists into a single .csv file called “Bing_Sentiment_All.csv.”
First ingest the csv:
> setwd(“~/Data/Dictionaries”)
> sent<-read.csv(“Bing_Sentiment_All.csv”, header=F)
Notice how it has two columns. The first represents the words and the second the valence (positive/negative). You can use other dictionaries that also try to label words by their “emotion,” which gives you even more to work with (and also more uncertainty!).
Next subset your scaled DTM by this dictionary, i.e. only keep columns that are in the dictionary list of words:
> dtm.sent<-dtm.scaled[, which(colnames(dtm.scaled) %in% as.character(sent$V1))]
This can be further subsetted by conditioning either on positive or negative words. In each case, you are going to collapse these multiple dimensions down into a single dimension called “sentiment,” which will either represent the sum of all so-called positive words in your documents, the sum of all negative words, or the sum of all positive + negative words.
To do this you do:
> sent.score<-row_sums(dtm.sent)
What you’ve effectively done is create a new “feature” called “sentiment” and labeled all your documents with this feature. You have collapsed your high-dimensional feature space of all words into a lower dimensional space of sentiment words into an even lower-dimensional space consisting of a single feature called “sentiment.”
Let’s say you just want to keep the positive words, i.e. a subset of your dictionary. You would:
> pos.score<-row_sums(dtm.sent[,colnames(dtm.sent) %in% sent$term[sent$valence == “pos”]])
This says take the sum of the values of the sentiment table by keeping only those column names that are in the sentiment dictionary where the valence is “positive.” I.e. you subset the sentiment dictionary using the last brackets and then subset the dtm.sent using the first brackets. Brackets always tell you you are taking a subset of something.
The nice thing here is you can do this with any dictionary. You can create a list of pronouns and then subset your document term matrix by these words and aggregate their frequencies into a new feature called “pronouns.” You could do it for “loudness” as TK has done, and scale a list of verbs for their perceived loudness (where “shout” has a much higher value than “whisper”). For every feature you create you can create a new table that constructs a feature space of these constructed features. Constructions all the way down.
In the section on hypothesis testing that you will see later, you will learn how you can use this method to answer questions like, Do women novelists in the nineteenth century use more positive language than men? Or, Do highbrow novels use more nostalgic vocabulary than popular fiction? One of the most rudimentary questions you can ask about your data is whether different types of texts (or authors) behave differently with respect to some dimension of language.
Before moving on to another method of dimensionality reduction it is worth pausing over the limitations of this method.
The limitations of using this approach are a) words often mean different things in different contexts and b) word lists are finite and may miss important aspects of your category that are not included in the lists. (You can think of two interlocking circles: the one on the left is all words in your dictionary and the one on the right is all words in your texts that belong to your category. Not all of your words will capture the words used in your texts to mean “sentiment” and not all of the uses of your words will refer to the sentiment you ascribe to them.)
Slang is notoriously hard here since it evolves so quickly. Think of this as a set problem. Your list of words might undercount your category of interest because that concept is enacted by words not on your list. Or it might overcount your category because not all instances of a word in your dictionary are being used in the sense you mean.Like everything, these things boil down to using your judgment. How static is your category and how non-polysemous (monosemous?) are your words? Swear words generally mean “swearing” (whether positive or negative is a different story), but there are so many possible options. Then again, it is often the case that a few words can account for a vast majority of all possible ways to express an idea.
In theory the list of words to express “perception” in novels is long, but in practice it boils down to a very small list of words that account for a vast amount of this action. We’ll see more complicated approaches you can use with machine learning. But for some questions creating lists of related terms can be a useful way of beginning to understand how linguistic behavior differs between different types of texts. You can also take the next step of testing which words in that category are doing more work in distinguishing groups from each other. So knowing one group swears more is one thing, but knowing that they do so using racist or misogynist swears is even more valuable knowledge. In our case above using sentiment analysis, which words account for the excess positivity you might be seeing in a category of documents? Burrowing down into your data is a great way to better understand what is going on. In the section on hypothesis testing I will cover this at greater length.