For the purposes of everything that follows in this section, all of the data and code are available at my GitHub repository.
So far you have learned how to think in models. In this section I want to begin to introduce you to all of the ways you might translate your ideas into measurements using computational text analysis. To do that you will need to learn how to make a text “machine-readable,” i.e. the steps involved in decomposing a text that once made (some) sense to humans to one that makes (some) sense to machines. And then you will need to learn how to construct different kinds of measurements, i.e. to work with the quantities that texts are composed of.
I will be using a sample data set called txtlab_Novel150_English, aptly named because I put it together and it contains…150 novels…in English. The texts are in the format of plain-text files with UTF-8 encoding. This is the gold standard towards which you want to strive when you build a corpus. A folder with a bunch of .txt files (now you get the name of the lab!). We will also be using a table of metadata with the same name but the file extension is .csv, which is essentially a bunch of information about your data. The more metadata you have the better. The format I am using for all metadata and tables will be .csv (for comma separated values). I am not going to talk about collecting your data (i.e. “scraping”) and transforming your data into .txt files (OCR, PDF conversion, etc.). That’s a whole different can of worms (and maybe the subject of future installments). But for now, let’s assume you have .txt files and a .csv metadata file.
Key point number 2. All of the code provided here will be in R version 3.6.1 (aptly named “Action of the Toes”). A Python installment will hopefully one day follow. All of the screenshots will refer to RStudio 1.2.5. Download both and install. All of the code is contained in associated .R files. For example, this section’s R file is called “01_Fish_PreparingYourData.R.”
I strongly recommend doing an online tutorial to get comfortable with programming in R before getting started here. While I find those kinds of tutorials incredibly boring — if I’m not actually doing something with my knowledge it goes in one ear and out the other — it will actually make this section way easier to follow.