Associated code file for this section: 01_HB_PreparingYourData_1by1.R
For much of what we will be doing in this book, having a document term matrix is the starting point and thus the TM package is a great way to go. It is a simple and efficient tool for building a matrix of word counts for your documents and allows you to run straightforward transformations on those tables.
However, there might be times where you want to calculate things on individual texts and aggregate those calculations as new features about your texts. The type-token ratio I discussed above is a perfect example. While you could get a global type-token ratio for each document using the document-term matrix, this isn’t a good idea. Why? Because remember TTR is very sensitive to document length and we know that our documents are very different when it comes to length. So we need another way to calculate it, which will have to operate on random text segments. Ingesting the whole document as a single “bag” (ugh) isn’t useful.
So I’m going to provide you with the basics of reading in documents one-by-one. The usual procedure is we read a document, clean it, and then perform some function on it, i.e. calculate something about it, and then output that calculation to a new table. That new table is now the feature space of your documents. Instead of just having “words” as features/variables, you now have these potentially more complex features. I say potentially because words are actually quite complex, especially when you look at them together.
Ok to get started, we first need to get a list of files in your folder where your texts are stored.
> f.names<-list.files(“txtlab_Novel150_English”)
Then, change your working directory to that folder. You can use the drop down menus in R Studio or type something like:
> setwd(“~/Data/txtlab_Novel150_English”)
The first thing we are going to do is create a function that reads in the texts and cleans them. Functions are great ways of simplifying your code because you can call the same function over and over again (and make sure that you are doing the same thing each time). Notice how you’re doing the exact same things to the document as in the previous section (removing numbers, punctuation and lowercasing). The syntax for functions in R are those curly-cue brackets. Everything inside of that is the function and the variable inside the funcion (in this case “x”) is the thing that the function will run on. We’re going to call our function “text.prep.”
text.prep<-function(x){
#first scan in the document
work<-scan(x, what=”character”, quote=””, quiet=T)
#remove numbers
work<-gsub(“\\d”, “”, work)
#remove punctuation
work<-gsub(“\\W”, “”, work)
#make all lowercase
work<-tolower(work)
#remove blanks
work<-work[work != “”]
}
The way you prepare to use a function is you “run” it like any other command in R (ie. run all lines of the function). It won’t actually do anything but it will store the function in memory as a variable name. Then, when you use that variable it will run the function on some other variable (confused? whatever).
If you want to see what this function does, you can type:
> work.v<-text.prep(f.names[1])
This runs the function on the first document and stores the output in a new variable.
>work.v[1:10]
This shows you the first ten words of the first document — everything should be lowercase and have no punctuation. Each word is indexed as a separate value (i.e. has numbers in brackets by it). This is a good way to check and see if your function worked. In general, always inspect your data!
The next step is to create a function that does something to the text you have ingested. We’re going to use type-token ratio. Type-token ratio divides the number of word types by the total number of words. It is very sensitive to length and thus performs best when you use texts of exactly the same length. Since our documents are not the same length, we will take small segments and randomly sample these from each document a bunch of times (in this case 100 times though you could do more). We’ll then take the average value of these samples as our “type-token ratio” for each document and store that in a new table along with the filename.
ttr<-function(work.v, winsize){
#first locate a random starting point
#take one random number from the overall length of the work (not including anything closer to the end than your sample will be, i.e. going past the end)
beg<-sample(1:(length(work.v)-winsize),1)
#extract that segment as a new vector
test<-work.v[beg:(beg+(winsize-1))]
#calculate TTR
#TTR = unique words / total words (i.e. types divided by tokens)
ttr.sample<-length(unique(test))/length(test)
return(ttr.sample)
}
Let’s unpack this function. First, it takes two variables (or arguments in the lingo) to run on. The first is the vector of words we created with our first function. The second is the size of the text window we are going to sample. The recommended value is 500 words depending on how long your texts are. So to run on the document we saved as “work.v” above, you would type:
> ttr(work.v, 500)
To run it multiple times on the same work, i.e. take multiple samples:
> replicate(100,ttr(work.v, 500))
This replicates the function 100 times. To take the mean value:
> mean(replicate(100,ttr(work.v, 500)))
Notice how you can put functions inside of functions in R!
That’s an important point because now we are going to combine these two functions and run them on all of the documents in our folder and then output the results as a separate table. To do so we are going to use a “for loop.” Everybody hates these things in R (so much snootiness around programming), but I find for certain tasks they are just fine. They also make it so easy to see what you are doing and how long things are taking because they work one at a time. This is a terrible idea if you are storing lots of things in memory (it will eventually grind to a halt) but great if you want to run something on one thing and then discard it and only keep a small bit of information and move on to the next thing.
So first create an empty table for our results:
> ttr.df<-NULL
Then create a loop that is as long as the number of documents you have. Notice how we’re using the same syntax as the function with curly brackets.
for (i in 1:length(f.names)){
#see how fast things are going
print(i)
#ingest and clean each text using our text. prep function
work.v<-text.prep(f.names[i])
#calculate the mean value of a vector of 100 TTR values
ttr.mean<-mean(replicate(100,ttr(work.v, 500)))
#store everything in a table
filename<-f.names[i]
temp.df<-data.frame(filename, ttr.mean)
ttr.df<-rbind(ttr.df, temp.df)
}
For 150 novels, this should take a few minutes to run. The output looks like this:

When we get to the section on corpus comparison and hypothesis testing you can start to analyze these results. For example, is there higher vocabulary richness in novels or non-fiction? Do books decrease their vocabulary richness over time? Do women have richer vocabularies than men in their novels? Just a few questions you might start to ask.