Design a site like this with WordPress.com
Get started

Validating your findings and your measurements

So is there? That of course is the/a big question. The first thing to remember is you cannot definitively answer it. Why? Because of all of the limitations we have encountered so far! Christie does not represent all writers. All writers do not represent all humans. The set of Christie’s novels do no represent all of her novels. The first 50,000 words do not completely represent those novels that are selected. Type-token ratio is not exactly equivalent to vocabulary richness and neither are repeating phrases or the prevalence of semantically vacant words.

Nevertheless, you can gain confidence that there is a relationship between Agatha Christie’s aging process and the behavior of her vocabulary as exhibited by her novels. How do you do so? The first thing you need to do is validate your measurements. In other words, how can you be confident that type-token ratio is a good approximation of vocabulary richness? And how can you understand what the severity of changes in type-token ratio mean? What does a drop of 1% mean? In other words, there are two types of validation, which I will call “instrument validation” and “model validation.”

If you are using an instrument that has already been implemented before, then you are in luck. It is likely the case that other researchers have already attempted to estimate the validity of the instrument and you can draw on that existing literature. To take a different example, let’s say you wanted to measure something akin to a text’s “difficulty.” While this is a complex and vague concept, there has been considerable research in this area, most notably beginning with the work of Rudolf Flesch. Flesch was a Viennese immigrant who fled Austria from the Nazis and came to the U.S. in 1933. He ended up as a student in Lyman Bryson’s Readability Lab at Columbia University. The study of “readability” emerged as a full-fledged science in the 1930s when the U.S. government began to invest more heavily in adult education during the Great Depression. Flesch’s insight, which was based on numerous surveys and studies of adult readers, was simple. While there are many factors behind what makes a book or story comprehensible (i.e. “readable” or conversely, “difficult”), the two most powerful predictors are a combination of sentence length and word length. The longer a book’s sentences and the more long words it uses, the more difficult readers will likely find it. Flesch then developed a formula that best predicted readers’ responses as to the text’s expected difficulty (and also estimated what grade level the text might be appropriate for). Since then, there have been over 30 different measures proposed to capture this concept, all validated in different contexts.

In other words, what Flesch did was first translate his theoretical construct of “readability” into a measurement and then assess that measurement relative to the judgments of human readers. He validated his instrument. This will be one of the basic moves in data-driven research that we will come back to time and again. The only way you can know if a measure is meaningful to humans is to ask humans. This will form one of the foundations of machine learning, where machines learn based on examples provided by human judgments. If those judgments are flawed or biased in some way (humans, biased? what?!) then the machine will be, too. Humans in, humans out as they say. Or as Yoshi Bengio remarked, “I’m not worried about the machines, I’m worried about the humans!” Me too.

Returning to Lancashire and Hirst, there has been considerable research validating the use of type-token ratio to assess the linguistic behavior of different groups. Teenagers, for example, repeat themselves a lot more when talking to each other than adults do. Surprise! The point is not that this is not surprising, but quite the opposite — the measure captures precisely what we expect. Therefore, it is a good measure. 

The second level of validation concerns validating the results of your model. This is a bit trickier (ok a lottrickier). What it means is that we are going to have to wade into the murky and tremulous waters known as the “Bay of Statistical Inference” (not to be confused with Bayesian statistical inference, which is only funny if you have already moved in). Once you go in, you never come out. Actually, it’s super interesting and worth a pitstop. But of course you may never come out. At least you won’t come out the same as you went in. Ok, let’s enter.

The first thing you need to know is that humans are terrible at statistical inference! Isn’t that just perfect? We rely on this mode of reasoning that is quite literally counter-intuitive. Who thought of this? Well, that story is equally checkered. A lot of early statistics were developed and deployed to prove the racial superiority of certain groups over others. Nice! So we have a method that was initially used to amplify our intrinsic racist tendencies that has the added benefit of running counter to our intuitive mode of reasoning. What could possibly go wrong?

The third thing you need to know is there can be different goals associated with your research and depending on these goals you are going to think about “validation” in different ways. You can think of these goals as relating to three possible types of research, which I’ll call “exploratory,” “explanatory,” and “predictive” modeling. 

Exploratory approaches seek to discover novel and unforeseen patterns latent in texts (or between texts). Rather than start with a priori beliefs about an object of study, exploratory approaches want to discover something that hasn’t been seen before. Exploratory approaches can be very good at developing hypotheses or theories about a body of texts as well as avoiding what is known as “confirmation bias,” where you structure your inquiry in such a way as to find exactly what you were looking for. (There is a fascinating body of reading on this topic that goes by the name of “p-hacking.”) Exploratory approaches are great at discovering new relationships between things, but are poor at assessing the validity or meaningfulness of those relationships.

Explanatory approaches attempt to explain the underlying principle for patterns found in data (i.e. estimating the true parameters of the process that produced the data). At their simplest they may try to explain the strength of associationbetween two variables (e.g. are highbrow novels more strongly associated with nostalgic narrative structures than lowbrow novels?). At their most complex, they may try to explain causal relationships between variables (does smoking cause lung cancer?). Causal relationships are in general extremely difficult to ascertain and prove (talk about a lot of reading material). Given the fact that in our case most text data researchers work with in the humanities is what is known as “observational” — it has been collected out in the wild and cannot be experimentally manipulated — moving to causal inference is outside the scope of most projects. Explanatory approaches are valuable because they can give us confidence that what we are seeing isn’t just some random effect, but an artifact of a cultural process. That nostalgia is so much more strongly associated with “high prestige” novels than bestselling ones suggests that there are selection mechanisms at work that are helping to drive this outcome. While further research would be needed to know whether this association was more likely due to production factors like editors, authors, and agents or reception factors like reader taste or prize committee dynamics, it can begin to give us confidence that this association is something that is actually taking place in the world. There is some kind of “effect” there.

Finally, the third approach is known as “predictive modeling” and is most strongly associated with the technique of machine learning. Here, the goal is to predict future behavior rather than explain underlying mechanisms for that behavior. As Yarkoni & Westfall has argued, these efforts may in fact be at cross-purposes — complex explanations for observed data may be very poor at predicting future as yet unseen behavior. Why? Because the initial explanations were over-fit to the observed data and don’t generalize well to future actions. In predictive modeling, the researcher’s aim is not to understand the variables or features that result in a successful prediction (indeed one strong assumption is that such qualities are unknowable (Yarkoni & Westfall, 2017), but to assess the predictability of the phenomenon itself. We will see examples of this approach in action and how it can be used to ask valuable research questions. But we will also see its limitations at work because very often the “features” that underpin cultural behavior are what we want to study. How predictable is fiction as a form of communication is a very interesting question. But so is the question, what are the qualities that are distinctive of fictional storytelling, i.e. how did you make those predictions? What matters? 

Let’s return to Lancashire and Hirst again. They want to know if there is an association between vocabulary richness and aging in Christie’s novelistic output. In other words, they are engaging in explanatory modeling because they would like to confirm whether there is a meaningful relationship between these variables (“meaningful” here will be used alongside “significant,” which is the more statistically related terminology). They’ve collected the novels, they’ve measured type-token ratio (you’ll learn how to do this in the next section), and now they want to know if there is a relationship between these two things. How did they do so? They use a procedure known as “linear regression.” It’s the bread and butter of the quantitative social sciences, for better and for worse (depending on who you ask). And it looks like this:

How does it work? I’m not going to go into the details here (actually not even later as this is all stuff that exists in basic statistical handbooks). But the point is that tests like linear regression (and anova, and t-tests and chi-squared tests and all sorts of statistical tests) help us validate our findings in some way. Using statistical tests like this is called “hypothesis testing.” You start with a “null hypothesis,” in this case, that there is no relationship between aging and type-token ratio in Christie’s work. This is a perfectly reasonable thing to assume. Maybe there is no relationship between these things. Then you estimate how improbable that statement is. According to the data, there is less than a 5% chance that this null hypothesis is true. Another way to think about it is if we took 100 writers similar to Christie (i.e. writers who experienced cognitive decline) in less than five cases we should find no relationship between their age and their vocabulary richness as measured by type-token ratio.

That’s small enough to encourage us to assume that this relationship is “valid” or to give us this much confidence that it is. But is the relationship meaningful? This is an important distinction. Something may be statistically valid, but experientially meaningless. This becomes especially true as the size of your data grows. The more observations you have the easier it becomes to show statistical significance. So the question you always need to ask yourself is, Does it matter? In this case, how much does vocabulary richness decline and what does this decline translate into in terms of words on the page? It may be related, but two things can be related and be essentially meaningless. For that you need to dive into the numbers and see how much her vocabulary richness declines and how much more repetition this exhibits and what this sounds like on a page of a novel. You might also want to compare this to other writers to see if this is a normal decline exhibited by “most” writers or if people with extraordinary cognitive decline show stronger tendencies. You might also want to test this a different way since assuming the relationship is “linear” might not be the best approach. Rather than exhibit a steady, continual decline, you might expect the decline to be sudden and accelerating. 

The point is that there are many, many ways to validate the association you are looking at. There is a tremendous amount of literature on the do’s and don’ts of statistical inference. I told you it’s a cove that is hard to get out of. But the main emerging consensus is not to focus too much on what is known as the p-value, which is the number that estimates the probability of rejecting your null hypothesis. Instead, the primary aim should be coming up with the best estimate for the thing you want to understand. Whether there is a “significant” relationship between alcohol consumption and lifespan isn’t the primary question. What this relationship is is (nice type token ratio there). In other words, what I really want to know is how many fewer days am I expected to live because I had a glass of wine today (and tomorrow and tomorrow and tomorrow).             

Or in the case of Lancashire and Hirst, how strong is the association between type-token ratio and aging and when does it appear to set in? How close is this to when colleagues began observing her decline or how might this be a useful early-warning system? Would it be creepy or helpful to have a piece of software diagnosing your emails to assess whether you were at risk for Alzheimer’s? These are the types of questions that text analysis inserts into the world and that a p-value alone can’t answer.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: