Design a site like this with WordPress.com
Get started

Working with Metadata

Associated code file = 01_HB_WorkingWithTables.R

As we have learned from the political landscape, metadata (or data about data) tells us a great deal about ourselves. We don’t actually need to know the content of our telephone calls, just the people we are calling to make inferences about predicting our future actions. 

The same is true of text data. In some cases, the data about the documents may be as interesting, if not more so, to study than the documents themselves. One reason as we’ll see below is that documents have so much varied information inside of them that it can be hard to make concrete judgments about them. Metadata simplifies our representation of our documents and in the best case scenario reduces them to the salient aspects of their meaningfulness.

In this section I am going to concentrate on working with tables because in many cases that is the most rudimentary form of data you may have and, again in many cases, it may be all you need. Before we get to the complexities of extracting text data from texts, we can start by getting comfortable with tables and manipulating data in R.

One of the things you will need to get comfortable with at the start in R is the nomenclature surrounding so-called “tables.” There are, alas, many kinds of tables in R but this is actually a good thing. It makes life easier as you’ll see. So how many?

– vector: this is not a table at all, but a single “column” of a table. Add multiple vectors together and you get…a table! So a vector is a one-dimensional object. If you’re coming from Excel, it’s like a column.

– list: a very fancy kind of vector, which is a vector of vectors (what?!?). Each item of a list is a vector. It’s convenient and efficient way to store information but is the hardest to work with so we will largely be avoiding them. When in doubt, go without.

– table: this is not a table at all (gotcha!), but a function in R. When you “table” a vector, it produces a…wait for it…list! This list consists of the unique categories of your vector that have been added up (tabled). For example, you have a vector of author names. By tabling them, you get a count of how many documents you have by each author. It adds-up the units of a vector. We’ll see examples in action below. 

– data frame: the closest thing to a table in Excel and the bread and butter of R because they can consist of multiple types of data. A data frame can have one column with numbers, one with strings, one with factors, etc. Data frames are amazing.

– matrix: like a data frame but can only consist of numbers. Matrices are good for doing math with, which we will occasionally be doing. So when you transform something from a data.frame into a matrix it is usually because you want to transform the values in the cells by some amount.

Some other terms you’ll need for this section regarding types of data inside of data frames:

– strings: words and numbers treated as alphanumeric “characters.” Characters are the units of strings.  

– factors: technically these are variables that take a limited number of values called “levels”. We’ll see examples below, but one example would be “genre” — if genre is represented as a factor (i.e. is what you’ve called a column of your table) then the different “levels” would be the types of genre you have (mysteries, romances, bestsellers, etc.). Factors are useful for assessing relationships between types of things.

– integers and numbers: R’s term for numbers. If you have a column of numeric data it will be represented either as “integers” (whole numbers) or “numbers” (not whole numbers).

The beauty of these categories is you can change things from one type to the other depending on your goals. Sometimes you want your columns to be strings, sometimes factors, sometimes numbers, etc.

In what follows you’ll get comfortable working with and manipulating tables (i.e. vectors, data frames and matrices). I haven’t included everything you can do with/to a table (ok dataframe from now on out) but I have tried to show you the most common operations.

As an example we will be using the metadata to the Novel150 data set called “txtlab_Novel150_English.csv.”

Start by setting your working directory

> setwd(“”)

Load your table. You should see a data frame with 9 columns

> a<-read.csv(“txtlab_Novel150_English.csv”)

What are those columns?

> colnames(a)

The metadata we have thus consists of some vanilla things like filenames and IDs, but also more interesting things like author gender and word count. Metadata can be anything and as you thing about a research project think about the different ways you might want to categorize your documents. This is metadata. The more you have the more multi-dimensional an understanding you will have of your data. 

Categorical Data

There are two kinds of data stored in our data frame: numerical data and categorical data. Numerical data for example refers to the dates of publication and word counts. Categorical data refers to the author’s gender or the novel’s point of view. R will refer to categorical data as “factors” that contain different “levels.” For example, “gender” is a factor that containes two levels in our data (and may contain more in your data).

Whenever you want to view a column of a data frame (i.e. a particular vector) you call it using the $ sign:

> a$filename

Notice how it will list the filenames but also say “150 levels:…” Unless you tell it otherwise R turns strings into factors. What does that mean? Strings = strings of letters or numbers (i.e. words or numbers) while factors = categories which can have multiple levels. So what has happened here is that R has turned the filenames into 150 separate levels of a single variable called “filenames.” Sometimes this is useful, sometimes not. You can tell if a word is a factor or a string by whether it has quotes around it. Notice how the titles don’t have quotes. That’s how you know they are factors

If you want to coerce a factor to a string, you can use as.character():

> as.character(a$filename[1]) 

OR for the whole thing:

> as.character(a$filename)

Factors are useful if you have a variable of interest with multiple levels. One example is authors. You may have multiple authors in your data set and you should have a general idea about how much author repetition you have. In general if you want to generalize about larger social practices you don’t want too many books by the same author.

To observe this you use the following two functions. This tells us how many “levels” there are in the “author” factor:

> nlevels(a$author)

This is akin to asking how many unique authors there are:

> length(unique(a$author))

To observe the vector of authors:

> levels(a$author)

To find out who has the most books in your data set, you can use the table function and sort it.

> sort(table(a$author), decreasing = F)

Notice how we can put functions inside of functions. First I table my data and then I sort it. I don’t need to do this with two separate lines of code. Yeah efficiency!

Notice also how I have sorted it from lowest to highest to see it better in R Studio. You can change decreasing = T to sort the other way. Very easy.

Let’s look at the column called “gender”.

> levels(a$gender)

Similarly we can use the table function to see what the ratio is:

> table(a$gender)

SAMPLE QUESTION: are women more likely to be associated with first or third person novels or neither?

While we’re getting a bit ahead of ourselves we could create what is known as a contingency table that records the following information:

            Female   Male

      1P     x         y

      3P     z         w

Here we are asking what is the ratio of women writers in the first person to women writers in the third person *relative* to men in the same categories. If they are equally distributed we should see no statistically significant difference.

First let’s build a data frame with this information extracted from our metadata.

To do this you subset the data frame “a” by the factor gender equalling women AND the factor point of view equalling first person novels.

> x<-nrow(a[a$gender == “female” & a$person == “first”,])

NOTE: brackets are for subsetting something. Because this is a data frame and thus has two dimensions, I use a comma after my conditions to indicate: keep only those rows that match these conditions. If I put this after the comma I am asking to keep only those columns that match these criteria. I then run the function nrow() because I want to know how many rows in my metadata table equal these two conditions.

Next, how many women wrote third person novels (i.e. NOT first person novels)?

> z<-nrow(a[a$gender == “female” & a$person == “third”,])

How many men wrote first person novels?

> y<-nrow(a[a$gender == “male” & a$person == “first”,])

How many women wrote third person novels (i.e. NOT first person novels)?

> w<-nrow(a[a$gender == “male” & a$person == “third”,])

Now construct a data frame by combining these four values:

> cont<-data.frame(c(x,z), c(y,w))

Inspect it to make sure it is correct and label it to avoid confusion

> colnames(cont)<-c(“female”, “male”)

> row.names(cont)<-c(“first”, “third”)

The results looks skewed, but would they pass a statistical test of some sort?

To do so we’ll use a fisher’s exact test:

> fisher.test(cont)

            Fisher’s Exact Test for Count Data

data:  cont

p-value = 0.1036

alternative hypothesis: true odds ratio is not equal to 1

95 percent confidence interval:

 0.233870 1.156767

sample estimates:

odds ratio 

 0.5280356

The outputs are interesting here. First the value under odds ratio tells us that women are about half as likely to write novels in the first person in our data set. But then it also tells us that we ought not to put too much emphasis on this difference given the small sample size and relative closeness of the values. How do we know this?

We see how the p-value is estimated at 0.1036. Typically this does not fall below our arbitrary threshold of 0.05, i.e. there is more than a 5% chance that if we ran this experiment many more times we might find no difference in the amount of first person novels women write relative to men. .10 is nevertheless close to 0.05 meaning it is still a relatively low value (this is why it is important to not think of .05 as a cut-off). In other words we have our human judgment, which says they do it half as much which feels meaningful and our statistical judgment, which says this could just be due to random chance, which feels meaningful in the opposite direction. We would want to report both aspects of our findings. And then go get more data.

What else can we learn from tables?

Numerical Data

Our data also contains two columns of numerical data. How do we handle that?

First, you could date the mean

> mean(a$date)

1862 is the mean publication date in this data. That’s very useful to know. It tells us where the centre of gravity is in our data. We can get fancier with summary:

> summary(a$date)

Now we know the earliest date associated with a novel and the latest. This is useful to assess the historical boundaries of your data. If your data stops at 1930 you cannot talk about the “20C novel” (not that that’s ever stopped anyone…)

We can also assess how well the mean and median line up — the less they do so the more skew there is in the data. Here they are almost the same.

A third way you can handle numerical data is to assess the overall distribution of values. Are some periods covered better than others?

First you can use a histogram. 

> hist(a$date)

The way to read a histogram is that the x-axis tells us the range of values of the variable you are looking at (in this case date of publication) from lowest to highest. Thus we go from a period slightly before 1800 to slightly after 1900. The y-axis tells us how many novels are inside of each bin. Thus we see that there are 30 novels published in what appears to be around the 1890s (not knowing exactly the range of the bin is a problem we’ll address next).

In general we see how we have more books from later decades in our data, though there is decent representation across the entire timeline. Depending on your research goals you would either want a sample that approximates titles published (and is thus skewed towards later periods because as time passes there are always more novels) OR you might want to sample evenly across all periods to ensure that your findings aren’t skewed by a particular period. See the section on data selection for addressing these complex issues.

An important caveat here is that the height of your bars will depend on where you place your bins. Bins with different widths might make the distribution look differently. In our case we could just plot each year as a bin and see what happens.

> plot(table(a$date))

This lets us see that we have no more than 4 novels from the same year, very few extended gaps and once again pretty good balance across the whole period.

Now try it yourself with the novels’ word counts.

You should get a mean of 123,240, a maximum value of 356,109 and minimum of 23,275.

Try making a histogram, too. What do you find?

SAMPLE QUESTION: Do women tend to write shorter novels than men in our sample?

How would you answer this? See the section on hypothesis testing to guide you  through a solution.

Manipulating Numbers in Tables

The final thing I want to look at is performing math on columns and matrices. So for example, what happens if you realized that every single book had in fact exactly 500 words of boiler plate in its front matter and so your word counts were over-estimated. If you wanted to remove 500 from the length values you go:

> a$adjusted.length<-a$length-500

Notice how I created a new column in case I made a mistake and preserved it as a new variable. I might also want to compare the old and the new. You should see that the new column has the same values minus 500 for each word count. R just performs the -500 on all values in the vector, which is very convenient. A few other easy commands:

This sums all columns in a matrix, although it is not appropriate for a data frame because some of your columns aren’t numbers.

> colSums() 

The same thing for rows

> rowSums()

Let’s say you wanted to study decades not years.

Let’s go ahead and transform years to decades by removing the final number and adding a 0. First we’ll convert the date column to a column of strings to utilize the substring function

> a$decade<-as.character(a$date)

Then we transform the 4th digit to a 0

> substring(a$decade, 4, 4) <- “0”

And convert back to integers

> a$decade<-as.numeric(a$decade)

Now we can see decade-level counts of novels

> plot(table(a$decade))

SAMPLE QUESTION: what is the avg. length of a novel per decade?

In order to run a function (in this case take the mean) over different aspects of our data, we will learn one of the “apply” functions. These are very useful — and very confusing in R. Here we will learn “tapply” which works on “factors.” We will treat our decades as factors in order to see what the avg. word count is by decade.

tapply takes as input the column you want to measure, the factor you want to subset by, and the function you want to run. In this case, we want to measure the mean of the “length” column relative to the different decades (our “factor” where each level is 10 years).

> dec.length<-tapply(a$length, as.factor(a$decade), mean)

This gives us a vector of means, which we can then plot. Because we are moving between factors and numbers the plotting is a little more involved.

> plot(dec.length, xaxt=”n”, ylab=”avg. word count”, xlab = “decade”)

> axis(1, at=as.factor(a$decade), labels = a$decade)

Notice how there is a period between the 1820s and 1860s where novels appear longer. This is particularly concentrated in the 40s, 50s and 60s.

Another way to visualize this is through the use of boxplots. These allow you to see the range of values for each decade which will give you a better sense of those periods that are particularly different. Read up on boxplots as they are a useful way of visualizing and comparing your data.

The boxplot function takes the following logic: we want to know the distribution of word length as a function of the decade of publication. So your dependent variable (here word length) goes first and then your independent variable (decade) goes second.

> boxplot(a$length ~ a$decade, ylab=”word count”, xlab=”decade”)

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: