Associated code file for this section: 02_HB_FeatureSelection.R
There are two basic facts about words that impact your computational modeling of documents. The first is: there are a lot of them! As you’ve seen, with a collection of just 150 novels we have over 200,000 word types. In the social and medical sciences, when researchers look at relationships between variables, they typically consider a handful (like less than 10). But we have 200,000!
This problem is known as the curse of high-dimensionality. When you try to model relationships between documents, if you have too many dimensions it becomes harder and harder to discriminate between variables and their effects. Increasingly, techniques like machine learning don’t really care about understanding features and focus more on whether a model can predict some real-world outcome in a reliable way. The number and even nature of the features matters less, though you still can run into the problem of over-fitting (i.e. learning something too specific to generalize from). I’ll come back to this issue later, but for now, having too many words can be a problem. It slows things down, adds noise into your analysis, and limits the generalizability of your findings.
The second fact is the one I already mentioned above: not all words are created equally. There is a tremendous difference between the relative frequencies with which words appear in documents. This will have a significant impact on how your models perform — because some words appear so much more often, they will provide much more information about your documents than others. If you care about those words then that’s great. If you don’t, then not great.
A foundational insight into language use was discovered by George Kingsley Zipf, chair of Harvard’s German Department prior to the Second World War. I’m a Germanist, so I think it’s cool that a Germanist played such an important role in the history of computational linguistics. Zipf’s law, as it came to be known, says that the frequency of a word in a corpus will be inversely proportional to its rank among words in that corpus. The simplest way to model this is as a 1/n function, where the nth most common word will occur 1/n times as often as the first. In other words, the second most common word will appear 1/2 as often as the first; the third will appear 1/3 as often as the first, etc. What this means in practice is that very few word types account for the vast majority of the word occurrences (or tokens) in your collection of documents. Here is a simple way to visualise this effect in your data.

What you see in this graph are words in descending order of their raw counts. The left side of the x-axis is the word that appears most frequently in the corpus (“the”), followed by the next most frequent word (“and”), etc. A careful observer will note that this pattern does not exactly follow Zipf’s law (“of” is not 1/3 as frequent as “and”). Humans! This is why culture doesn’t obey laws but probabilities. Nevertheless, we can still see a massively steep decline in the rates at which words are used (which we would say roughly follows a “power-law”). You don’t need to be a mathematician to see that not all words are behaving the same in our corpus.
Why does Zipf’s law matter (even if it is not perfectly following the letter of the law)? Because words on the left side of your graph occur very frequently and words on the right very infrequently. The left side carries a lot more information from a mathematical point of view. However, it carries a lot less information from a semantic point of view. “The,” “and,” and “of,” are not the most interesting words (I mean sometimes they are, but that’s a different story). If your question is about what makes documents different at the level of content or theme, then these words won’t tell you what you want to know, but they will drown out every other word.
However — this section will just be a string of howevers — these left-sided words can be very informative for certain tasks, such as authorship attribution. It turns out that our individual writing styles can be detected quite accurately by focusing on just these high-frequency low-semantically valuable words, which are what we are referring to when we use the term “stopwords.” Every person carries around little behavioral ticks that come across in these little words that are used all the time. This is what researchers refer to as “style” in the field of stylometry. Fascinating, right? You’re most you in your most vacuous linguistic state. Of course.
So given Zipf’s law you have a series of choices you can make based on your research goals.
Choice 1: Keep only stopwords
The first and easiest thing to do is to only keep stopwords. To do this you can either limit your words by an arbitrary list of words or by some mathematical cut-off. Caveat 1: There is no such thing as a definitive list of stopwords. Different libraries will provide lists of different lengths for different languages. In addition, different domains might have words that are stop-like but only for that domain. A classic example for literature is “said.” It is not on the usual list of stopwords, but it is incredibly frequent in works of fiction (for obvious reasons). It behaves like a stopword, but only in fiction. So each domain probably has lists of words that you ought to treat like stopwords, which is another reason that knowing your domain well matters.
The second approach is to keep a fixed number of words, either some round arbitrary number (like 100, 250, 500) or using some statistical test like an elbow test. This estimates where that curve you see above begins to “turn” and cut-off your words there. This is a much better approach than picking a random threshold. It is probably also better than using a fixed list, though fixed lists are nice in that they are consistent across all experiments. The same list of words can be used on different data, testing the replicability of the finding, etc. For our purpsoses here, we’re going to use the list of stopwords included in the TM library.
So how do we keep stopwords?
First, create a variable that includes your stopwords.
> stop<-stopwords(“en”)
Here I’m using the built-in list from the TM library. Notice how this list includes punctuation. So let’s remove that.
> stop<-gsub(“[[:punct:]]”, “”, stop)
gsub is a nice function. It takes three arguments. First what should be removed (punctuation), second, what should replace it (nothing), and third, where these things should be removed from (our variable called “stop”).
Now you want to remove all words except those words.
> dtm.stop<-dtm.scaled[ ,which(colnames(dtm.scaled) %in% stop)]
Notice the syntax here. I’m doing something to a portion of my DTM, so I use brackets. I want to focus on the columns, so I work after the comma. Things before the comma refer to rows, things after to columns. (If your data object is one-dimensional, i.e. doesn’t have rows and columns, then you just use an index number in brackets without the comma.) What am I doing to the columns? I’m keeping only the column names that are in the list of stopwords, i.e. that match exactly. This leaves me with 172 columns. This is a much smaller matrix!
Choice 2: Remove stopwords
For most projects, this is the more common step. The beauty here is we are just going to do the inverse of what we just did. So instead of saying keep only column names in the list of stopwords, let’s keep only column names not in the list of stopwords. To run a negative command in R we use an !, but with %in% it weirdly goes at the beginning not with the function itself (R!).
> dtm.nostop<-dtm.scaled[ ,which(!colnames(dtm.scaled) %in% stop)]
Now we still have…a huge matrix! We only removed 172 columns. So there is still work to do.
Choice 3: Remove the long tail
The next step is to remove the long tail. All those words that occur very, very infrequently. Here again there are a couple of ways to go. You could simply keep the top X words, where X is a nice round number, like 1000, 3000, 10000, etc. This is often done in practice. Why? Well, look at the 10,001st word. (Notice how I’m now going to work with means rather than sums because I am using scaled frequencies. We want the avg. percentage at which a word occurs across the corpus.)
> top.words2<-sort(col_means(dtm.nostop), decreasing = T)
> top.words2[10001]
It’s “schlegel.” It occurs 86 times in a set of 150 novels, i.e. much less than 1x per novel on average. After that comes “wench.” In other words by the 10,000th word we are encountering vocabulary that is very rare.
So to condition on the top 10K words:
> top.words2<-sort(col_means(dtm.nostop), decreasing = T)[1:10000]
> dtm.top10k<-dtm.scaled[,which(colnames(dtm.scaled) %in% names(top.words2))]
Another way to do this which I prefer is to keep words that appear in a majority of your documents. In other words, rather than condition on word counts, which might be biased to words that appear a lot in 1-2 novels but nowhere else (like “schlegel”), you can keep only those words that are common across your corpus, where common = in the majority of documents.
The request here is simple:
> dtm.sparse<-removeSparseTerms(dtm.scaled, 0.4)
The catch here is to pay attention to that decimal number (0.4). It represents the percentage of documents in which a word appears below which a word will not be kept. In other words, in this case we will remove words that only appear in 40% or fewer of all documents. Another way of saying this is that we are keeping words that appear in 60% or more of all documents. So if you want to keep words that appear in a minimum of 5% of your documents you would set the decimal to 0.95, and if you want to keep words that appear in 95% of all documents you set it at 0.05.
So now you have a matrix that has only 3,761 columns and no stopwords. What are the last five words and the top five words now?
Here’s how to do that:
> sort(col_means(dtm.sparse), decreasing = T)[1:5]
> sort(col_means(dtm.sparse), decreasing = T)[(length(colnames(dtm.sparse))- 4):length(colnames(dtm.sparse))]
You should get:
#First five
said 0.004348011
one 0.003127073
will 0.002224044
now 0.002142355
little 0.001835339
#last five
quicker 0.000010368207
xi 0.000009448406
xii 0.000009130416
xiv 0.000008027988
xiii 0.000007774898
Oops. We probably don’t want chapter headings in our analysis. We also might not want “said” or even “one” (maybe “will”? you see how this is a slippery slope). This is the single most important rule when working with data:
always inspect your data!
When you do so, you see you still have more work to do. For now we’re actually going to go backwards and remove all words with less than 3 lettters as well as a very short custom list of words.
Remember we created a table that contained no stopwords. Let’s start there and remove any columns that are less than 3 letters:
> dtm.nostop<-dtm.nostop[, which(!nchar(colnames(dtm.nostop)) < 3)]
This removes 420 words.
Next, let’s remove a custom list of words.
> stop.xtra<-c(“said”, “one”, “will”)
Let’s also append a list of roman numerals to that list.
> stop.xtra<-append(stop.xtra, tolower(as.roman(1:1000)))
Now remove those words from the DTM:
> dtm.nostop<-dtm.nostop[, which(!colnames(dtm.nostop) %in% stop.xtra)]
Rerun remove sparse terms:
> dtm.sparse<-removeSparseTerms(dtm.nostop, 0.4)
And voila! You should have a table with 3,592 columns. The lowest occurring words are: asserted, sets, and quicker.The important point here is your feature space represents the most common words across your corpus not including stopwords and three custom words which we consider stop-like.