When you begin topic modeling, the first thing you need to do is consider the length of your documents. Topic modeling looks at the likelihood of co-occurrence of words within the same “space,” where space is defined as the boundaries of a given document. So if your documents are short to medium length and topically focused, like newspaper articles or academic articles, then using the whole document is a fine choice. If your documents are long and topically diverse, like novels, then not so much. Your documents are so long that almost all words appear in the context of all other words (this is an overstatement, but you hopefully get what I mean). To handle this problem, you will first need to shrink your documents down to smaller sizes (called “chunks,” lost my appetite).
Since we are working with the novel data as our sample, you’re going to first break them down into smaller units. To do so you can use the following code. You are going to read in each novel one-by-one, so we’ll use that code again. This is the text ingestion and cleaning function:
text.prep<-function(x){
#first scan in the document
work<-scan(x, what=”character”, quote=””, quiet=T)
#remove numbers
work<-gsub(“\\d”, “”, work)
#remove punctuation
work<-gsub(“\\W”, “”, work)
#make all lowercase
work<-tolower(work)
#remove blanks
work<-work[work != “”]
}
Remember it’s a function, so you have to run it to store it as a variable (to be run later). The next step is to write a loop that goes through each novel and divides it into chunks of the same size and writes those as separate files to a new directory.
Set the chunk size. Here we use 1000 words. This is a good default, but 500 also works well. As you’ll see there are not hard and fast rules for topic modeling, which is of course one of the problems for using it as an explanatory mechanism. Make sure to document all of your choices and as much as possible test different options. This can help shore up how robust your insights are and how dependent they are on small tweaks to parameters of the model.
chunk<-1000
Create a loop that goes through each original file:
for (i in 1:length(f.names)){
#set your working directory inside of your novels
setwd(“~/Data/txtlab_Novel150_English”)
#see how fast things are going
print(i)
#ingest and clean each text using the text.prep function
work.v<-text.prep(f.names[i])
#set your working directory to a new folder for the chunks
setwd(“~/Data/txtlab_Novel150_English_Chunks_1000”)
#set file number integer
n=0
#go through entire novel and divide into equal sized chunks
for (j in seq(from=1, to=length(work.v)-chunk, by=chunk)){
n=n+1
sub<-work.v[j:(j+(chunk-1))]
#collapse into a single paragraph
sub<-paste(sub, collapse = ” “)
#write to a separate directory using a custom file name
new.name<-gsub(“.txt”, “”, f.names[i])
new.name<-paste(new.name, sprintf(“%03d”, n), sep=”_”)
new.name<-paste(new.name, “.txt”, sep=””)
write(sub, file=new.name)
}
}
If you used 1,000 word chunks, then you should have 18,407 new files.