Chunking Texts

When you begin topic modeling, the first thing you need to do is consider the length of your documents. Topic modeling looks at the likelihood of co-occurrence of words within the same “space,” where space is defined as the boundaries of a given document. So if your documents are short to medium length and topically focused, like newspaper articles or academic articles, then using the whole document is a fine choice. If your documents are long and topically diverse, like novels, then not so much. Your documents are so long that almost all words appear in the context of all other words (this is an overstatement, but you hopefully get what I mean). To handle this problem, you will first need to shrink your documents down to smaller sizes (called “chunks,” lost my appetite).

Since we are working with the novel data as our sample, you’re going to first break them down into smaller units. To do so you can use the following code. You are going to read in each novel one-by-one, so we’ll use that code again. This is the text ingestion and cleaning function:

text.prep<-function(x){

#first scan in the document

work<-scan(x, what=”character”, quote=””, quiet=T)

#remove numbers

work<-gsub(“\\d”, “”, work)

#remove punctuation

work<-gsub(“\\W”, “”, work)

#make all lowercase

work<-tolower(work)

#remove blanks

work<-work[work != “”]

}

Remember it’s a function, so you have to run it to store it as a variable (to be run later). The next step is to write a loop that goes through each novel and divides it into chunks of the same size and writes those as separate files to a new directory.

Set the chunk size. Here we use 1000 words. This is a good default, but 500 also works well. As you’ll see there are not hard and fast rules for topic modeling, which is of course one of the problems for using it as an explanatory mechanism. Make sure to document all of your choices and as much as possible test different options. This can help shore up how robust your insights are and how dependent they are on small tweaks to parameters of the model.

chunk<-1000

Create a loop that goes through each original file:

for (i in 1:length(f.names)){

#set your working directory inside of your novels

setwd(“~/Data/txtlab_Novel150_English”)

#see how fast things are going

print(i)

#ingest and clean each text using the text.prep function

work.v<-text.prep(f.names[i])

#set your working directory to a new folder for the chunks

setwd(“~/Data/txtlab_Novel150_English_Chunks_1000”)

#set file number integer

n=0

#go through entire novel and divide into equal sized chunks

for (j in seq(from=1, to=length(work.v)-chunk, by=chunk)){

n=n+1

sub<-work.v[j:(j+(chunk-1))]

#collapse into a single paragraph

sub<-paste(sub, collapse = ” “)

#write to a separate directory using a custom file name

new.name<-gsub(“.txt”, “”, f.names[i])

new.name<-paste(new.name, sprintf(“%03d”, n), sep=”_”)

new.name<-paste(new.name, “.txt”, sep=””)

write(sub, file=new.name)

}

If you used 1,000 word chunks, then you should have 18,407 new files.

Share this:

Leave a comment Cancel reply