There are two salient facts about text data that impact this next step. The first is, your documents are very often of wildly different lengths. Some of our novels are 6x longer than others. So if you are using word frequencies, those frequencies will look different just by the nature of the document’s length. I might use twice as many “toes” as you, but if I am four times longer, than you then you’re the one with the toe fetish.
The simplest solution is to scale the “raw” counts by the total number of words in each document, effectively turning the values into percentages. This makes all the values commensurate with each other because they now represent how often a word occurs relative to the total length of the document. To do this, you type:
> dtm.scaled<-corpus1.dtm/row_sums(corpus1.dtm)
What we are doing is dividing the entire table (corpus1.dtm) by the total word counts of each novel (row_sums). You now have a new table called dtm.scaled where the values are no longer raw counts, but percentages. As you’ll see, they are very, very small (because you are dividing a very small number by a very large number). To inspect the first 10 values you can go:
> dtm.scaled$v[1:10]
[1] 0.00002741905 0.00002741905 0.00008225714 0.00008225714 0.00005483809
[6] 0.00002741905 0.00016451427 0.00002741905 0.00008225714 0.00002741905
Why are there some repeating very small values? Because those are likely all 1 counts, i.e. a word that appeared once in a document. Let’s check. We know the Man of Feeling is the first novel and we know that it’s word count is:
> row_sums(corpus1.dtm)[1]
So if you divide 1/row_sums(corpus1.dtm)[1] you do indeed get 0.00002741905, which says that whatever this word is it appears .0027% of the time in The Man of Feeling, or just under three times every 100,000 words. That’s very infrequent!
The second problem you will encounter working with text data is that word frequencies can vary considerably from word to word. Not only are documents not all created equally, neither are words. Some words, like “the,” appear a lot. Some, like “sublimation,” less so. Indeed, 0 and 1 values are very, very common when it comes to text data. This is what we call a sparse matrix (i.e. not very well filled in). I will go into greater detail in the next section on why this is the case and the best ways to handle it, which usually involves removing words (ack, you’re taking stuff out!? and I thought “normalization” was bad). Yep, we throw stuff out all the time. Which is really important to think about when you are modeling the world. What went missing and why/how does it matter? Your map can never be the same size as the world. Good models are parsimonious models.
In the meantime you should at least know about at least one of the more straightforward transformations that can be applied to handle this problem. The first is called tf-idf, for term-frequency*inverse-document-frequency. What does this string of meaningless letters mean? It means we divide the raw counts of words not by the total number of words in each document (as we just did), but by the number of documents they appear in. Think about it for a second. Let’s say you are a very frequent word. Your count is high. You look like you matter more than the other words because you’re so big. But then tf-idf comes along and says, yes, but you’re everywhere. You’re not actually that interesting and don’t really help me distinguish between these documents and their potential meanings very well. So if I take that big number and divide it by an equally big number (the number of documents you appear in) you will get smaller. Conversely, a word that appears less often but only in a few documents will grow in importance. We will be dividing a small number by a small number and thus that number will appear relatively large.
The logic here is that words that are frequent in some documents but less widely used in the corpus are more meaningful than words that appear a lot in every document. That’s a big assumption. In practice, it has been recommended not to do this, depending on the nature of your documents. It was once thought to be the bread and butter transformation, but is less widely used these days. I have found for example that it works better for poetry than prose, where you want to condition on rarer words that distinguish a document from its peers because poetry is by nature more of a keyword driven genre (but that is also a very large hypothesis…). Like everything in this field, this still needs more systematic study.
If you did want to transform by tf-idf, you use this simple request:
> dtm.tfidf<-weightTfIdf(corpus1.dtm, normalize = TRUE)
And that’s it. You now have a document term matrix where your texts have been normalized and your quantities have been normalized. If you want to get to the next step, i.e. beginning to explore your data and relationships between your texts, skip the next section. If you want to learn another technique on how to ingest your texts one at a time so you can calculate some measure on each of them, keep reading.