Reading in your data using the TM library

Associated code file for this section: 01_HB_PreparingYourData.R

In this section, you will learn how to ingest a folder of text files using the TM library. To prepare, create a “Data” folder where you store the sample data we will be using called “txtlab_Novel150_English.” As with all sections, all of the code that is presented here is also located in an associated code file on GitHub named above.

First, you need to install the library of functions we’ll be using.

> install.packages(“tm”)

> install.packages(“textstem”)

> install.packages (“slam”)

You only need to do this once in R (not every time you work with this code). Once your libraries are installed, you then load the libraries. You will need to do this everytime you open R.

> library(“tm”)

> library(“textstem”)

> library(“slam”)

Next, set your working directory to where you data is stored. Don’t point at the folder with the data but the folder just “above” that one where the data is located. As mentioned above, the folder with my data is called “txtlab_Novel150_English” and it is located in the folder “Data” on my computer. Thus to set my working directory I type:

> setwd(“~/Data”)

You can also do this using the drop down menu in RStudio under “Session.” This allows you to navigate to the folder where your data is stored.

Locate your working directory in R Studio

We’re now going to read in your corpus. We will create a variable called “corpus1” and load your collection of text data into that variable.

> corpus1 <- VCorpus(DirSource(“txtlab_Novel150_English”, encoding = “UTF-8″), readerControl=list(language=”English”))

To see what is inside your variable called “corpus1,” you can “inspect” it. To see information about the 26th document in your corpus type:

> inspect(corpus1[26])

Notice how we use brackets in R to find certain indexed values, in this case the 26th document. Parentheses are used for telling a function what variable to work on (in this case the variable “corpus1”). Typing the above will show the following information about your third document:

<<VCorpus>>

Metadata: corpus specific: 0, document level (indexed): 0

Content: documents: 1

[[1]]

<<PlainTextDocument>>

Metadata: 7

Content: chars: 680732

That’s a lot of junk but the key piece of information is that this document has 680732 characters. You can also look at the text itself to see how well it was ingested.

> strwrap(corpus1[[26]])[1:5]

Here I’m printing out the first five lines [1:5] of my 26th document [[26]], which turns out to be a novel you’ve probably heard of:

[1] “”

[2] “CHAPTER I.”

[3] “”

[4] “It is a truth universally acknowledged, that a single man in”

[5] “possession of a good fortune, must be in want of a wife.”

Ok, you did it!

Share this:

Leave a comment Cancel reply