Creating a document-term-matrix

Ok now you are ready to transform your text data into a table of word frequencies. You are transforming the text from one kind of representation, words in order, into another kind, often referred to as a “bag of words” (which is just gross, but whatever). The technical term we will use is a “document-term matrix,” which when flipped around is a “term-document matrix.” This table has the following structure: rows are your documents (also called “observations”) and columns are your words (also called “features” or “variables”). The values in the cells are the number of times a word (feature) occurs in a document (observation). It’s one very simple command:

corpus1.dtm<-DocumentTermMatrix(corpus1, control=list(wordLengths=c(1,Inf)))

Here you see how you are running the function DocumentTermMatrix on the object “corpus1.” And we have added a bit called “wordLengths”. The TM package defaults to only keeping words of 3 or more letters in length. If you want “all” words, you need to tell it to keep words from length 1 to infinity, which is what we’ve done here.

Sometimes you might not just want to work with single words but potentially word pairs (also called “bigrams”) or any length sequence of words (also called “ngrams”). This is more often the case when you are working with part-of-speech analysis for example because there tend to be fewer parts of speech and thus capturing sequences doesn’t lead to utterly sparse data. When you look at word ngrams the frequencies at which any two words appear in sequence is incredibly low and so unless you have lots and lots of data this won’t tell you much. For most purposes words or 1grams are a fantastic tool. However if you did want to generate bigrams or any level ngram, here’s how:

> BigramTokenizer2 <- function(x)unlist(lapply(ngrams(words(x), 2), paste, collapse = ” “), use.names = FALSE)

First create a function that says how many words in sequence you want to gather. The “2” above is what you need to change — if you want 3grams then you put (words(x), 3), etc. If you want to capture one grams and bigrams, then you would do the following:

> BigramTokenizer12 <- function(x)unlist(lapply(ngrams(words(x), 1:2), paste, collapse = ” “), use.names = FALSE)

To run this function when making your document term matrix, you run the Document Term Matrix function and add the Bigram Tokenizer function inside:

> dtm.bigram <- DocumentTermMatrix(model, control=list(tokenize = BigramTokenizer2, wordLengths=c(1,Inf))) # control=list(wordLengths=c(1,Inf))

What this outputs is a table with all pairs of words that appear in your documents. It is massive, which is another reason to avoid using them if you can help it.

Share this:

Leave a comment Cancel reply