Preparing your data

Now we can use our TM library to create a document-term-matrix again. Instead of 150 novels we now have 18,407 parts of novels (i.e. rows).

Set your working directory:

setwd(“~/Data”)

Ingest the corpus:

corpus1 <- VCorpus(DirSource(“txtlab_Novel150_English_Chunks_1000”, encoding =

“UTF-8″), readerControl=list(language=”English”))

No cleaning necessary as we already did that in our function above.

Create a DTM:

corpus1.dtm<-DocumentTermMatrix(corpus1, control=list(wordLengths=c(1,Inf)))

We now run into the problem of dimensionality reduction again. Wait, I thought you said topic modeling helped reduce dimensions! Well, before we do that we have to reduce our dimensions. I know. The reason is Zipf’s law. Some words appear so often that they drown out the other words that are likely to be more interesting. What words? Stopwords, but also proper names. Especially proper names. If you work on novels or other kinds of historical documents with lots of proper names, then your model is going to produce topics with lots of proper names. This may be exactly what you want but in most cases it isn’t.

Removing proper names is actually incredibly challenging. What’s a proper name? Anything that an author uses to refer to a character. Creating an exhaustive list is, well, exhausting. It often involves trial and error. And then you discover things like “sir” and “madame” or “doctor” and keep adding…

The other approach is to use a custom dictionary that has common words to your domain but lacks proper names and stopwords and keep only those words. The way I built this dictionary was to undertake the steps in the previous section, remove sparse words and stop words, and then review the list to try to ensure there aren’t any proper names in it. (If you find some let me know!). To do this:

setwd(“~/Data/Dictionaries”)

keep<-read.csv(“Dict_English_NovelWords_3000_NoStop.csv”, header=F, stringsAsFactors = F)

Subset your dtm by the dictionary:

dtm<-corpus1.dtm[,which(colnames(corpus1.dtm) %in% keep$V1)]

Caveat: it can be the case that after implementing your dictionary that some documents go to 0 words. This could happen with really short documents or a very small dictionary. If that is the case see the accompanying code. Otherwise go to the next line.

Change the dtm to a matrix:

corpus2<-as.matrix(dtm)

Share this:

Leave a comment Cancel reply