Selecting Data

Lancashire and Hirst wanted to know something about the relationship between age-related cognitive decline and exhibitions of vocabulary richness in the work of creative writers. To do so, they chose to work on a single author, Agatha Christie. How representative is Agatha Christie of the general population of humans? Not at all. Or better: we have no idea. A single case study cannot tell us anything general about the world. I would say it again, but you get the point. When you read case studies in traditional humanities scholarship that then proceed to generalize from that case study, you ought to be very, very concerned. 

However, this does not mean that case studies are not valuable. They are. It is just that their value has limitations. So what is the value? Case studies are useful because they can help begin a process of inquiry — by developing theories, establishing core concepts, and testing methods — in other words, all of the stuff that matter to research. But if you use the case study approach, you need to be very explicit about the limitations surrounding what you are doing and reflect on those limitations in your writing. 

The level of selection does not stop with Agatha Christie however. The authors engage in a further round of selection by not testing every single novel Christie ever wrote. Instead, they examine a collection of 14 novels written over the course of her career out of a total of 85. Further, they only examine the first fifty-thousand words of each novel to control for different lengths. In other words, their analysis examines a writer selected from the pool of all writers, whose work is approximated by a selection of all her works, which are approximated by a selection of a portion of each work. That’s a lot of approximations.

The point here is not to fault Lancashire and Hirst. In almost every circumstance you will be using a selection of some larger whole. It’s no different when you focus on one novel or some sermons to think about the past. In the social sciences they refer to the selection of documents as a “sample” and the whole as the “population.” Despite the claims of big data, it is extremely rare to have every single document that belongs to whatever category you are examining. Even if you do (if you had all parliamentary transcripts ever), you won’t likely have data on every single aspect that matters to those documents (all of the extra-parliamentary actions that effect those statements and help shape their meaning). If you want a good review of the pro’s and con’s of big data, I urge you to read Jonathan Sagalnik’s Bit by Bit. Much of what I will be saying here is inspired by his work.

The point I want to make is not that selectivity hopelessly ruins a good project (it certainly hasn’t stopped humanists until now!). It’s that you need to make those choices explicit (as Lancashire and Hirst do) and reflect on the appropriateness of those choices. This is one of the biggest differences with traditional humanities scholarship. When examples are chosen in traditional scholarship we rarely know what the larger pool is from which they’re drawn, how representative they are of that pool and what the criteria of selection were (other than “it supports my argument”). In data driven research, you make all of these choices out in the open. Or at least you’re supposed to. (The difference between theory and practice is always large, hello humans.) 

So let’s assume for now that you will always be working with samples. This means you will need to reflect on the representativeness of your data. What does representative mean? In technical terms, it means that every member of the target population you wish to study had an equal probability of being selected. If not, then your sample is “biased” (don’t panic yet). 

Take the case of Lancashire and Hirst. Their sample, for example, is biased towards the beginnings of novels. Not all parts of novels had an equal chance of being selected in their data. What effect might this have on their findings? Similarly, the choice of novels was dictated by digitally available books. Again, their sample is biased because not all of Christie’s novels had an equal chance of being selected for their sample. What effect does digitization have on the qualities of her novels?

The important point here is that we don’t know the answers to these questions. It is possible that these issues matter greatly. It is also possible they don’t matter at all for the conclusions that the authors will eventually make. People love to use bias as a cudgel. But the issue is, as always with these things, more nuanced.

Take the example of the study of British doctors who smoked. I love this one, not only because it is about doctors who smoke. It was also the first major finding to show a relationship between smoking and lung cancer. That’s incredibly important. But the study is biased! Of course British doctors are not representative of all British citizens, let alone humans. And yet it would be extremely unreasonable to believe that the mechanism that explains the relationship between smoking and lung cancer in doctors was not potentially transportable (i.e. generalizable) to other populations of people. It is also extremely reasonable to ask whether the associaton between these two things may behave somewhat differently in other types of people. I.e. the sample bias opens up knowledge about something and initiates questions that are worth further study. It can do/be both.

Sagalnik refers to this as the difference between within-sample comparisons and out-of-sample generalizations. Biased samples are less problematic when you make comparisons within them (compare doctors who smoke with doctors who don’t smoke). They are potentially more problematic when you make generalizations beyond the sample (all other people will behave like doctors who smoke). However, the point of the doctor example is that even biased samples can be potentially useful for out-of-sample generalizations.

What matters most in these issues is a) acknowledging the problems of representativeness of your sample and b) indicating that any gaps or biases are important avenues for future research. You don’t need to answer every potential problem before you start a project, but you do need to be aware of them and list as many as possible for other researchers to consider. Before you reach firm(ish) conclusions based on your data, you will want to test many more things to be certain they don’t matter. This is another important rule of computational modeling: you can’t do everything at once. There will always be omissions and your knowledge is always partial. The map is never equal to the territory, otherwise it ceases being a map (and also useful, try unfolding that in a car). This is why scholarship is a collective endeavor. Other researchers, or your future self, can begin to fill in these blanks. Limitations don’t invalidate a study — they just limit the generalizability of what you have found. 

So what are the potential problems that can be introduced by your selection of data? There is no single study that has dealt with this issue at length in the field of text analysis (hint, hint). Katherine Bode has written a great book about the problems of existing data sets in the field of literary studies which I recommend for further reading.[1] It is a useful starting point to think about these ideas. For now, I will try to outline a few of the biases you might want to look out for when collecting your data:

– Geographic bias. Are all of your documents from the same “space” (whether national, geographic or linguistic)? If so, how might the inclusion of documents from outside of that space impact your findings? Why are you only looking at documents from a single national framework?

– Temporal bias. The most obvious form of this kind of bias is when you limit your study to a particular timespan. If you study a set of Victorian novels, obviously your findings are not generalizable to earlier or later periods of time (this hasn’t stopped researchers from doing this of course, i.e. modernity). But there are subtler forms of temporal bias as well. For example, even if you stick to saying something about Victorian novels, there were not equal numbers of novels published during the period, i.e. production was not always the same year in and year out. So if you want to represent the period you have to make a choice whether you wish to weight each year equally in your sample or whether you weight certain years more heavily depending on whether more novels were produced during those years. Do you want to capture time (an equal representation of each year) or production (a weighted representation according to numbers of novels published)? There isn’t necessarily a right answer here, but the choice does depend on your research goals.  

– Demographic bias. How well does your sample approximate the underlying demographic distributions of your target population? For example, if we know that the percentage of published women authors during a time period was 40%, then your sample should have 40% women authors in it (this gets more complicated if their representation also changed over time, i.e. if the probability of seeing a woman author increases/decreases by year, then your sample should also reflect this). Other forms of identity that we know impact forms of expression are authors’ racial, ethnic, or sexual identities as well as their class or educational level. If you want to approximate how humans write stories then you shouldn’t only study published authors (or printed books for that matter). There will be biases in that pool that do not reflect the overall demographic distributions of the human population.    

– Survivorship bias (aka Reproducibility bias). Survivorship bias is a term that is used to capture the situation when the data you collect reflects only units that survived some selection process rather than considering all of the possible observations from a target population. (See the Wikipedia example about WWII planes, it’s fascinating and clear.) Many of the documents that can be computationally studied are reflective of this problem, especially with respect to the past. Books that have been digitized, for example, have been subject to a variety of what we might call “stacked” selection pressures, such as popularity, canonization, indexing, technological fit (as when certain books are more easily digitized through OCR than others) and profitability. Prior to digitization, certain books were also more likely to be reproduced (i.e. reprinted), which in turn leads to them being more likely to be digitized, although the logic for such reproduction may not be the same (in the print realm it may have been profitability and in the digital realm it may be accessibility or technological fit, etc.). A digitized book is thus the latest iteration of a series of historical mechanisms that have favored one type of book over another. Without a sample of unselected books (i.e. non-surviving), your sample will be biased. The crucial problem here is that this makes it very hard to know if what you are finding is the result of hidden historical factors that contributed to the availability of the documents you are studying or whether the findings are intrinsic to the category you are studying (i.e. are you seeing something about Victorian “novels” or Victorian publishers, twentieth-century libraries, 21st century corprorate archiving services, OCR technology, etc.).

– Algorithmic bias. Much of the data researchers wish to study is generated and protected by corporations, whose mission is not to provide full access to the data. Thus when you wish to collect data, you may be offered a selection of that data that is subject to algorithmic filtering, which you may not have knowledge of. For example, Goodreads provides access to readers’ responses to books but only for a limited number of comments that are ranked by a proprietary algorithm. Without knowledge of that algorithm you cannot know what those comments are representative of. Similarly, as discussed in the previous example, algorithmic models also mediate the reproducibility (and thus selectability) of documents. OCR error introduces another form of algorithmic bias into your data, for example, when not all of your documents behave the same under the conditions of OCR reproduction. Finally, online platforms are often organized according to certain algorithmic features that will impact people’s behavior. The most famous example is the way most people on Facebook have around 20 friends. You might be tempted to come up with a theory to explain this (20 is such a magic number!) when in fact the best way to explain this is because Facebook encourages you to link yourself to twenty people when you sign up and then stops. The platform’s algorithms guide certain behavioral outcomes and so when you observe behavior without knowledge of the underlying engineering priorities of a platform you may not realize that what you are seeing is an effect of algorithmic conditioning not human preferences.

– Platform bias. People who are present on particular digital platforms are not necessarily representative of the population as a whole. To use Goodreads again, Goodreads users are not a good demographic mirror of all human beings or even people within a single national context. You cannot generalize about “readers” when you are sampling “readers on Goodreads.” This holds for other digital platforms such as Twitter, Facebook, or Reddit. In some cases, a particular platform’s target population is precisely what you want to know something about, such that the nature of the platform is part of the object of study. In this case, focusing on a single platform is appropriate, provided that it is possible to get a representative sample of “Goodreads” or “Twitter” users, which is also not always the case. However, it is possible that the mechanism you are studying may be generalizable beyond the biases of your sample, i.e. similar to the doctor’s study it could be the case that this demomgraphically skewed world can still tell you something useful about contemporary readers.

– Typographic bias (maybe also paratextual bias). This problem refers to the way texts are produced according to certain conventions that are independent of the semantic features you may care about. This can take two forms. The first is when texts are associated with certain semiotic conventions — i.e. signs that accompany texts in a particular format. In the printed realm this could be running headers, prefatory material, etc., anything referred to by Gerard Genette as “paratext.” It is totally reasonable that paratext may in fact be what you want to study (like Genette). But it can also be the case that if you are not able to differeniate between these dimensions of a text that your finding may be related to these conventions of production than they are the contents of the text. A really frequent word may be due to the author’s choices or it may be due to the fact that it appeared on every single page and your data retains these representations. In the digital realm this might take the form of metadata attached to user posts, as on Twitter. You wouldn’t want to get excited about the frequency of @ or # signs on Twitter. Think about whether there are semiotic conventions surrounding your data that you want to avoid considering for your research question. The second type of typographic bias is when a texts semiotic conventions do not align well with reproduction technoloies. The classic example is OCR. Certain texts perform much more poorly under OCR than others. Studying OCR may be your object of study, in which case this is fine. If you want to study the underlying texts you need to make sure that your results aren’t due to the underlying differential alignment of some of your texts with reproduction technologies than others. This problem is closely related to algorithmic and platform bias. 

– Reception v. Production bias. Katherine Bode has argued that when we look at collections based around dates of publication, we miss out on the actual historical circulation of books that readers would have had access to.[1]This is an extremely important and complex point. When you are selecting data, is your aim to represent a population of books “published” or books “read”? This might impact for example whether you weight your sample of books more heavily depending on the known print runs of books or whether you weight your sample evenly over time to capture the distribution of items regardless of how many were bought and sold. In the first scenario, your sample would have more books that sold more copies because you want to approximate what readers were buying. For example, if you want to create a sample of “popular” books, then bestseller lists can be a very useful tool. On the other hand, if you want to approximate what is being produced by a particular industry, then you will want to sample in a way where all books from a given list would have an equal chance to be sampled because you want to approximate what publishers were publishing. In neither of these cases are you sampling based on what people actually “read.” In scenario 1, you are using “printing” as a proxy for “reading,” which of course are not the same thing, and in scenario 2 you are using “publishing” for a proxy for reading. Lots of books are printed which are a) never bought and b) never read. Capturing “reading experience” is thus an extremely challenging problem and requires data beyond your primary documents (i.e. texts produced by readers or measured reader responses through laboratory experiments).

You can see just how hard it is to create an “unbiased” sample. To do so, you need accurate demographic information about the population you wish to study, something we very often do not have in the humanities, especially when studying the past. But you also have to assume that everything else you will do over the course of building your model will not introduce “bias” into your findings. Which is of course impossible. It is for this reason that I recommend thinking about this process as a form of representation — how have you represented your population and what are the limitations of doing so? This will allow researchers to prioritize further tests to see where the distortions of your model lie and what effects those distortions might have. Rather than posit some pristine state of neutrality that you will never reach, it is better to focus on all of the ways you are introducing perspective into your knowledge and let others see how distorting your perspective is.


[1] Katherine Bode, Digital Collections and the Future of Literary History (Ann Arbor: University of Michigan Press, 2018).

2 thoughts on “Selecting Data

  1. First instance of «OCR» abreviation in the text:«more easily digitized through OCR than others)» For the landsman like me it would be cool to know what it stands for. I infered it is a digitalisation method of some kind… Anyway, thanks for all that information, it enables us to put perspective on the data and be more knowledgeable about the limits of our futur conclusions.

    Like

    1. Thanks Joseph! Yes, OCR refers to “optical character recognition” which is a way of digitizing printed texts to make them available for computational text analysis. Since it’s not a perfect tool it introduces a lot of potential error into the documents depending on how old the original text is (older is more irregular and therefore harder to OCR).

      Good to know what terms need elaborating.

      Like

Leave a comment

Design a site like this with WordPress.com
Get started