Design a site like this with
Get started


You could but why would you?

A critic once quipped that using numbers to study literature, culture, and history was like throwing a fish at a painting. You could, but why would you?

It turns out the answer to that question is eminently straightforward. Imagine you want to say something about “the Victorian novel” or “contemporary television” or “the history of sermons in New England.” The old way to do this was to read a lot of documents or watch a lot of TV and then say what you thought they “all” had in common. Or you just chose one and decided it was really representative of those big categories.

The problem is easy to spot here. Are the documents you read representative of the category you invoked? How would you know? How much variability is there within them? Do they all do exactly the same thing or behave in slightly different ways? Is there any way to validate that what you are seeing isn’t just a figment of your imagination or, to be less charitable, a result of your personal bias? Are you sure that five examples is enough?

You might think that such general statements aren’t all that important to the study of human culture (i.e. the humanities). What really matters are individual texts, lives, images, or some really neat details about the past. I’m sure you have probably heard this before. You might even believe it yourself. And it may be true in theory, but it is not born out in practice by the people who say it. Here are some examples of the types of statements you are likely to find in humanities research today:

Over the past few decades, humanists have insisted that it is important to resist generalizations.

The process of secularization associated with modernity has not spread as widely nor penetrated as deeply as Western humanists have tended to assume, not even in the academy.

Without its world, the human is merely another species on earth, testing itself against threats of its own creation and in the process becoming a force like nature (capable only of overt behavior) that jeopardizes its own existence.

Western European philology developed in the eighteenth century at much the same time that the notion of literature did. 

The funny thing that happened to charm on its way to modernity was the disenchantment of the West.

Today, anything, it would seem, can be art.

In fact, in my lab we have taken the trouble to quantify just how often these kinds of statements are made in fields like history and literary studies. It’s not easy to do, but a rough estimate suggests that approximately 30-40% of statements made in the introductions to research articles in these fields are generalizations (itself a generalization). So it appears that drawing general conclusions about literature or history or art or television or film is fundamental to the humanities, despite protests to the contrary. Indeed, data suggests that talk about large-scale social phenomena has actually been growing for the past few decades. Despite what people say about reading individual documents, what they really want to do is talk about the social importance of those documents.

And why shouldn’t they? The function of literature or TV or sermons or parliamentary debates or social media or newspapers in society are great questions, questions for which we don’t yet have great answers. But one thing we do know is you can’t reliably and convincingly come up with answers to these questions by picking a few of your favorite novels or shows or documents from the past and talking about them. This book is about teaching you some of the techniques to begin to be able to draw (tentative!) conclusions about how literature, screenplays, lyrics, or any set of documents work in the world. It’s about thinking small (through modeling) to think big (how culture and history work). 

Let’s say for a moment that you didn’t want to make such general assessments. You just wanted to focus on a single work of art or philosophy and tell your readers why you think it is so great. Scholars do this all the time, too. In this case, numbers might seem like a poor fit for your task. But here too you would be wrong. What makes Goethe’s Wilhelm Meister novels or Rousseau’s Confessions special (or unique, or great, or just plain thought-provoking) has everything to do with the context that you use to argue for their specialness. They are significant with respect to some comparison (novels or memoirs that came before, other novels or memoirs produced at the same time, all novels or memoirs ever). How can you be certain that the significance you’ve identified isn’t also a product of your imagination, or once again, bias? 

Wait, you might be asking yourself, did he just equate “imagination” with “bias”? I thought the whole point of the humanities was to facilitate our creative thinking!? Fair point. At their best, the humanities are a massive engine of intellectual innovation, as we take the repository of written works created throughout human history and spin out new ideas from them. The Achilles heel of this enterprise, which is a concept that of course derives from the humanities itself, is when we want to make a claim that something is a true and accurate depiction of the past (or present). The problem is when we turn our creative intuitions into authoritative norms. This is what caused the First World War or the point of the novel is to [fill in the blank]. Traditional methods of close reading individual documents do not have a way of externally validating these subjective impressions. This isn’t a problem if you want to exercise your subjectivity. It is a problem if you want to contribute to scholarship. 

In order to build up an understanding of how different kinds of writing have functioned in the past and continue to do so in the present, quantitative methods are not only not alien to this exercise, but fundamentally necessary. To put it in more polemical terms, there is something problematic when you only rely on personal judgments to make normative statements about how the world works. Why? Because when you do so, the ultimate grounding you have to base these judgments is charisma — your individual power, status, and rhetorical persuasiveness. You could, but why would you?

This book starts with the assumption that quantitative methods have a role to play in the study of literature, culture, and history. Notice I did not say an exclusive role to play. There can be no enumerative reading without close, detailed attention to how texts or artifacts work at an individual and fine-grained level. One of the differences between proponents of quantitative methods and its critics is that critics argue that it has no place at all in the study of the humanities. Proponents suggest instead that it has a complementary role to play. Domain knowledge, and all of the traditional training that goes along with this knowledge, is still essential for large-scale analysis. You cannot scale up if you have no idea what is going on at the local level. Your creativity and subjectivity is still extremely important for thinking about how fields like literature, art, and history work, no matter what the scale. Those close encounters with documents or images will help you develop hypotheses, construct theories, and help validate your models. In other words, they help with you almost all of the steps of the research process. If you want to be a good data-driven humanist, you will need to read, look at, or watch a lot of things that belong to your field. There is no substitute for experience and learning.

Nevertheless this book will argue that quantitative methods have something important to offer to the study of human cultures, which I will group into three larger categories: 

Representativeness. Data-driven methods afford the ability to reflect on representativeness —  when and under what conditions is the evidence we are considering representative of something? Traditional methods have a very poor track record on this, often eliding these kinds of considerations altogether — why did I consider this document? what other documents did I consult but not report on? how did I come to access these documents? what are they representative of and how might I confirm this assertion? why did I look at so few and yet claim so much? As I will discuss at length in the section on data selection, having more data does not magically mean we have representative data. But it does mean we have the means to have a discussion about representativeness, which is currently not the case.

Validation. Data-driven methods also afford the ability to reflect on the validity of what has been argued. Validity is of course a very complex topic, especially with respect to questions of cultural meaning (as Clifford Geertz famously asked, what is the meaning of meaning?). What would be a valid interpretation of Pride and Prejudice? The point is not that there will be a final answer as to the meaning of Pride and Prejudice or the British novel, but that data-driven methods afford researchers the ability to provide a justification for their interpretation that is also external to the researcher’s personal beliefs. This interpretation is reasonable under these and these circumstances. Or: this interpretation holds with this and this much uncertainty. The process of validation provides mechanisms for externalizing truth claims beyond just our individual authority or belief systems. Notice how I say “just” — because these things are all still in play in data-driven research. Research is still people, as the saying goes. But the aim of data-driven research is to create an evidentiary paradigm that relies on as much shared understanding of a problem as possible. The aim of shared understanding is founded on three core values: visibility (i.e. transparency), externality and reproducibility. As Brian Nosek has argued, “Other types of belief depend on the authority and motivations of the source; beliefs in science do not.”[1]

Reproducibility. This brings me to the final affordance of data-driven research. The primary aim of using data in this book will be to arrive at insights about how the world works. This means that those insights ultimately ought to be independentfrom the observer who makes them. And this means that someone else, at a different time and place, ought to be able to reproduce those same insights. Of course, the idea of “reproduce” is a complex one, too, and I will spend some time discussing its uses and abuses. But at its core it says that something can be believed to be true when two or more independent individuals are able to arrive at the same conclusion. This may involve using the same data and methods or it may involve using the same methods on new data (or vice versa). We also know that “independent” here is a useful fiction — two researchers are never wholly independent from one another because they belong to an interlocking web of institutions and belief systems.

 In other words, all three of these values have very strong caveats attached to them. But the argument of this book is that even in their conditional form they are better than current research practices for the purpose of making generalizable claims about how human culture works. So if that is what you want to do, then read on.  

What follows is a guide on how to construct quantitative models for the study of human culture. It is designed for people with a background in the humanities, but who might have little or no knowledge of quantitative modeling. It starts at the very beginning, both in terms of programming skills but also conceptualizing research questions. It should therefore be a useful handbook for advanced undergraduates all the way to faculty interested in incorporating large-scale analysis into their research. 

As you will see, the book focuses principally on the study of documents, but hopefully you will be able to see how many of the same principals can be applied to the study of sound or images. There are lots of tools, handbooks, and guides out there. Most of these focus on communicating to you a set of programming commands. That will be important here too. You can’t analyze texts as data if you don’t know how to handle texts as data. In the second section, this book teaches you how to do large-scale data analysis using R (and eventually Python), the two lingua franca of data-science today. But my principal focus will be on the conceptual problems that attend the process of data-driven humanistic research. You can have all the tools in the world at your disposal, but if you don’t know why you are using them then you won’t be a very successful researcher. 

The first section is about how to develop your thinking about modeling the world. It discusses how to develop key concepts; how to select data and watch out for a variety of kinds of selection bias; how to develop measurements of your concepts and watch out for the slippages that occur between these steps; and finally how to validate your measurements using a variety of statistical tools. This first section does not go into great detail about each step, but provides a conceptual blueprint for thinking through data-driven research questions in the humanities. 

The second section provides an in-depth look at how to analyze text data. It takes you through the process of reading in files to transforming them into analytical objects like term frequency matrices to more advanced tools like topic modeling and machine learning. In particular, I will introduce you to the idea of feature construction, which is the process of building features that will in turn be used to represent your documents (there is that term representation again). A model is always an approximation — the process of feature construction allows you to reflect, critically and explicitly, on your modeling choices, on the way you are translating a complex idea (like “textual difficulty”) into a series of potential measurements. Overall, the section focuses on bringing you through three possible research frameworks, which are called “exploratory,” “explanatory,” and “predictive” modeling. The approaches and their approriateness for different research aims will be discussed so that you will have a good idea of which approach might be most suitable for your own research questions.

The final section offers examples of models in practice on real-world questions that matter to researchers in the humanities. Having taught this for close to a decade now, one of the things I have learned is that the biggest barrier isn’t actually technical. Programming is annoying and hard if you have never done it before. For sure. But it can also be fun, like tinkering around with making a sculpture, a song, or a work of art. The real barrier is opening your mind to thinking about cultural problems quantitatively and not qualitatively, getting beyond the I think Yeats is great approach. You can learn linear regression or machine learning in literally hundreds of textbooks today (or online for free). What is not at all clear is how to use those tools to do research in the humanities. That is what this book is about.             

Let’s go throw some fish.


            [1] Open Science Collaboration, “An Open, Large-Scale, Collaborative Effort to Estimate the Reproducibility of Psychological Science” 657.

One thought on “Preface

  1. If I get it, in data based research, we are still throwing a fish at the painting trying to understand how the world works by attempting to use the tools themselves . Yet this can enable us to make new arguments and statements about humanities and create a new «shared» understanding.


Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: