Once you have your question, your concepts, and your data, you then need a quantitative proxy for your ideas (or theoretical constructs). As one of my colleagues likes to remind me, translating human experience onto the real number line is a strange thing to do. There simply has to be a great deal lost in doing so. As I hope this book will show, however, there is also much to be gained. Doing so allows you to test your ideas on considerably more pieces of evidence than you ever could by hand; it allows you to make visible and explicit your methods and your working procedures; and it allows you to try to make your conclusions about how the world works independent of your own personal beliefs. I say “try to” because of course those beliefs always enter into the process at numerous stages (like every one we’ve seen so far). Remember, you’re not operating in the realm of absolute truth, but one of “confidence.” You are trying to gain as much confidence as possible that something is true, knowing you can never definitively know if it is true or not. Disagreement is ok. It is how we all gradually move closer to a piece of knowledge that we hold together.
Measurements are the tools you will use to arrive at this consensus, along with the process of validation and discussion. You create a measure, then validate its effectiveness, and then discuss what you have done to interpret your results. The measurement phase is thus one of the most creative steps in the process. How might you translate your concept onto the real number line? This is yet another act of representation and specification, as something large and vague is represented by something more precise. Both are important.
To return to our running example, Lancashire and Hirst represent “age-related cognitive decline” as “declining vocabulary richness” in their conceptual phase. They then represent “vocabulary richness” according to three specific measures that turn this idea into a quantities: type-token ratio, repeating phrase frequency, and indefinite word frequency. While we will spend considerably more time thinking about measuring features in the next section, let me just explain briefly here what each of these terms means so that you can see how each measure is itself a representation of the underlying concept (and in turtles all the way down fashion how each measurement is only one way to represent the measurement itself — huh? don’t worry this will make sense).
Type-token ratio calculates the relationship between word types and word tokens. A word “type” is a class of words — for example, “the” is an instance of a word type — while a word token is an instance of a given type, i.e. every time “the” appears in a text it is another token of the type “the.” The ratio between types and tokens calculates how much repetition there is in a document. The higher the ratio (the closer to 1) the less repetition there is — we are using fewer instances of every type. The lower the ratio (the closer to 0) the more repetition there is — we use each type over and over again (i.e. our numerator, the types, is very small, while our denominator, the tokens, is very large). Language-use naturally has a lot of repetition, which actually increases the longer you write or talk (given a finite number of words, I will inevitably start reusing them more and more the longer I go on, which is why lectures seem increasingly boring over time, though there are other reasons for this, too). This is why Lancashire and Hirst concentrate on a fixed length of texts because otherwise the longer ones will naturally appear more repetitive.
The next two measures they use are repeating phrase frequency and indefinite word frequency. They define repeating phrases as 2- and 3-word units (which we will call “n-grams” in the next section) that appear more than once. Indefinite words are defined as a list of words indicating ambiguous definition, such as someone, somehow, something, etc. I hope you can see right away how the measurement itself has layers too: indefinite word frequency is a type of “vocabulary richness,” but this list of words is itself a type of “indefinite words.” You might define “repeating phrases” differently, too. In other words, we are always using representations (see above).
The final measurement the authors use is “age.” In other words, they are looking to test whether there is a relationship between Christie’s age, which will be called the independent variable, and their measures of vocabulary richness, which will be called the dependent variables. They want to know whether this thing called vocabulary richness, which they’ve measured in three separate ways, is dependent on Christie’s age (which is not to say that the one causes the other, just that there is an association between the two).