Design a site like this with
Get started

Chunking Texts

When you begin topic modeling, the first thing you need to do is consider the length of your documents. Topic modeling looks at the likelihood of co-occurrence of words within the same “space,” where space is defined as the boundaries of a given document. So if your documents are short to medium length and topically focused, like newspaper articles or academic articles, then using the whole document is a fine choice. If your documents are long and topically diverse, like novels, then not so much. Your documents are so long that almost all words appear in the context of all other words (this is an overstatement, but you hopefully get what I mean). To handle this problem, you will first need to shrink your documents down to smaller sizes (called “chunks,” lost my appetite).

Since we are working with the novel data as our sample, you’re going to first break them down into smaller units. To do so you can use the following code. You are going to read in each novel one-by-one, so we’ll use that code again. This is the text ingestion and cleaning function:


  #first scan in the document

  work<-scan(x, what=”character”, quote=””, quiet=T)

  #remove numbers

  work<-gsub(“\\d”, “”, work)

  #remove punctuation

  work<-gsub(“\\W”, “”, work)

  #make all lowercase


  #remove blanks

  work<-work[work != “”]


Remember it’s a function, so you have to run it to store it as a variable (to be run later). The next step is to write a loop that goes through each novel and divides it into chunks of the same size and writes those as separate files to a new directory.

Set the chunk size. Here we use 1000 words. This is a good default, but 500 also works well. As you’ll see there are not hard and fast rules for topic modeling, which is of course one of the problems for using it as an explanatory mechanism. Make sure to document all of your choices and as much as possible test different options. This can help shore up how robust your insights are and how dependent they are on small tweaks to parameters of the model.


Create a loop that goes through each original file:

for (i in 1:length(f.names)){

  #set your working directory inside of your novels


  #see how fast things are going


  #ingest and clean each text using the text.prep function


  #set your working directory to a new folder for the chunks


  #set file number integer


  #go through entire novel and divide into equal sized chunks

  for (j in seq(from=1, to=length(work.v)-chunk, by=chunk)){



            #collapse into a single paragraph

            sub<-paste(sub, collapse = ” “)

            #write to a separate directory using a custom file name

  <-gsub(“.txt”, “”, f.names[i])

  <-paste(, sprintf(“%03d”, n), sep=”_”)

  <-paste(, “.txt”, sep=””)




If you used 1,000 word chunks, then you should have 18,407 new files.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: