Is there a text in our data? This was the elementary question first brilliantly posed by Michael Gavin.[1] Another way of asking it would be, when we treat texts as data what happens to them? How much of the text survives and how much is lost? But also, how are they transformed into new kinds of objects and how do these new objects relate back to their original form (or do they at all)? In other words, Can computational models be appropriate approximations for inferring the meaning of texts for human readers?
These are obviously extremely challenging questions to answer. But in many ways they are the most fundamental ones that we have to answer before we proceed. Over the course of this section you will be learning numerous techniques to model texts. But underneath all of these models lies an assumption that modeling texts as data is a reasonable thing to do. As Gavin reminds us, how could Google return the exact best document to your brief queries without an adequate theory of the relationship between the computational modeling of texts and your beliefs about their meaning?
The theory that will underpin everything to come rests on what is known as the “distributional hypothesis.” This hypothesis, which I’ll explain in a moment, depends on the belief that there is a fundamental relationship between quantityand meaning when it comes to (human) communication. Without that, none of this would make sense. So the distributional hypothesis says that symbols (or documents) that share similar probability distributions of related features will share similar meaning. What does this mysterious statement mean? It means that the meaning of words (and documents) is determined by the frequency of words (and documents) surrounding them. In other words, a word’s meaning (let’s start there) is shaped by the words that appear “near” it. This is the word’s “context.” And the frequency with which you see these words provides a model of that word’s meaning.
The easiest way to think about this is to think of the same word that can have two (or more) different meanings. When you see the word file and it means a folder containing some documents, you are likely to encounter certain words that help you ascertain that this is not a nail file. The word “likely” there is really important. You don’t always see the same words every time. But certain words appear more often. This observed frequency can be translated into a probability — when I see the word “file [folder]” I expect to see word A about x times or x times more often than word B. This is how children acquire language. You hear a strange sound xbvgrfh which you’ve never heard before and then you hear some other sounds which are more familiar to you and gradually over time you build a model of what to “expect” when you hear xbvgrfh. This becomes xbvgrfh’s meaning.
All of this holds for documents as well. We model the meaning of a document as the probability distribution of words that appear within it. But we do more, too. We model it as a distribution of all words in all documents we are studying, i.e. we model it as words it could use but does not. The document’s “context” here is other documents. To take an example, When I read a work of fiction I am building a model in my head of what to expect when I read fiction based on the “features” I encounter when I read. The more I read the more precise my model becomes and the more sophisticated my features. There are of course always surprises, so I update my model accordingly. A work is classified as fiction by my mind when it uses certain features at a certain rate with respect to other types of books I have read (in this case “non-fiction”). Since it’s a probabilistic model it can be “wrong,” but it can also be “updated” or improved. Cultural behavior doesn’t follow natural laws (with one big exception, see below). But that’s what makes it so fun to study. Humans are weird.
The distributional hypothesis is thus predicated on two fundamental assumptions: first, a cognitive assumption about the probabilistic way we assess meaning and, second, a rhetorical assumption about the spatio-temporalal way we assess meaning in written and spoken discourse (i.e. proximity). The first assumption says that we use these contextual probabilities to assess meaning when it comes to communication (whether of words or images or sounds). The more I see some set of features the more I assign a specific meaning to something and the more two things share similar distributions of these features the more similar I will assume them to be. The second assumption says that proximity is a useful heuristic for context. My attention wavers and so my model of what is relevant to understanding a word is limited. It turns off after a limited period of time. Closer is more valuable than further, but what this window is and how linear the relationship is is a very, very open question. Nevertheless, the distributional hypothesis has been remarkably successful in replicating human judgments surrounding the meaning of texts (Bod) as well as predicting processes of human language acquisition in childhood (see the subfield of “statistical learning” (Thiessen)). Recent work in the field of information theory has also capitalized on probabilistic models of meaning and communication (Crocker).
So when critics say, you’re really just counting words, the answer is: yes exactly! And so are you! The difference is, I am doing it explicitly and you are doing it implicitly. The problem is, we don’t know exactly how and to what extent quantity informs meaning when it comes to communication and cognition, which means that when we model texts as data there is a great deal of uncertainty as to the exact relationship between our models and their meaning. So before we get all high and mighty about our models we need to admit that there is a lot we do not yet know. But what a great area for future study! The beauty of not knowing something is that if it’s important then it’s worth studying. Why wouldn’t you want to study how communication works?
[1] Michael Gavin, “Is There a Text in My Data? (Part 1): On Counting Words,” Journal of Cultural Analytics September 17, 2019. https://culturalanalytics.org/2019/09/is-there-a-text-in-my-data-part-1-on-counting-words/.