Design a site like this with
Get started

Normalizing your data 1: Textual normalization

Associated code file for this section: 01_HB_PreparingYourData.R

The next step after you’ve read in your text files is to clean them up, where the preferred term is normalization. There is plenty to be written about the hygienic and normality biases of quantitative thinking. For now think of this as a form of conditionality — my observations could be accurate under the following conditions (which will not mirror all possible states in the world).

You’re going to do things to your texts to remove differences that you feel are not important to your research question, i.e. you are going to homogenize them in some way.


How do you know when a difference doesn’t make a difference to your question (to paraphrase Gregory Bateson)? That, friends, is a very hard question to answer. Much of it goes back to your research goals and also the important aim of removing confounding variables. For example, imagine some result you found was due to having left in capital letters, i.e. capital letters explained something about the world. This is either really interesting or totally trivial. If the latter, then you want to remove them, if the former, then keep them in! (I told you it was a hard question to answer.)

The most important point is that you reflect on these steps and also document what you did. Increasingly, the emphasis in quantitative work is to document all decision points in the research process because these can lead to high levels of variability in terms of findings and interpretation. It could be the case that capitalization or punctuation is very important to your research question. For example, if you are trying to detect spam, capitalization can be very useful. If however you think Today and today and today, all mean the same thing for your purposes then you will want to “normalize” the spelling of your words by lowercasing them and removing punctuation. For a computer each of these versions represents a unique “type,” which we probably (mostly, usually, whatever) want to represent as the same type. So here are the four recommended steps of textual normalisation (and as a good Canadian-American I mix the two spellings, which might be something you want to normal-eyes too.

#strip white space

> corpus1 <- tm_map(corpus1, content_transformer(stripWhitespace))

#make all lowercase

> corpus1 <- tm_map(corpus1, content_transformer(tolower))

#remove numbers

> corpus1 <- tm_map(corpus1, content_transformer(removeNumbers))

#remove punctuation

> corpus1 <- tm_map(corpus1, content_transformer(removePunctuation))

Ok, so far so good (more or less). Let’s now reinspect those opening lines of Jane Austen and see what happened.

> strwrap(corpus1[[26]])[1:5]

[1] “”                                                                     

[2] “chapter i”                                                             

[3] “”                                                                     

[4] “it is a truth universally acknowledged that a single man in possession”

[5] “of a good fortune must be in want of a wife”

Yeah! It worked. All lowercase, no punctuation, and no numbers (arabic at least, not roman…).

There is one final option here which is to either stem or lemmatize your words. This transforms words into even more elementary units — for example the lemma of “is” is “be” and “has” is “have.” Running, ran, run become “run, run, run” after lemmatization. To do so:

corpus1.lemma <- tm_map(corpus1, lemmatize_strings)

corpus1.lemma <- tm_map(corpus1, PlainTextDocument)

For example, when we lemmatize Jane Austen’s opening we see:

> strwrap(corpus1.lemma[[26]])[1:5]

[1] “”                                                                    

[2] “chapter i”                                                            

[3] “”                                                                    

[4] “it be a truth universally acknowledge that a single man in possession”

[5] “of a good fortune must be in want of a wife”

Notice how “is” is transformed into “be” and “acknowledged” has lost its past tense and is now the root form. These all make sense and will allow you a more general representation of the types of words in your document. In practice, though, there can be a lot of error associated with this step and the jury is still out on whether it improves things for different kinds of tasks.

For now we are not going to be using it. But it’s good to know about! (Also, if you do use it, give your corpus a new name like I did so you can decide which type of representation to use and remember which one you are using!).

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: