Feature Space

So far you have learned how to ingest a lot of documents and normalize them, both textually and mathematically. The next step is to begin to think about how you are going to represent your documents through variables, which we will call “features.” A “feature” is any quality that belongs to a document that you choose to measure quantitatively. A feature space is the sum of features you choose to use in your analysis.

In your case, the primary feature you have been using so far are words. This is a reasonable choice given that one of the primary components of things called documents are words! HOWEVER, and yes I couldn’t resist the all caps, words are not the only features that belong to documents. In this section you will be learning some popular types of features that can be used in text analysis (you already saw one example, type-token ratio, above). The important point is that just about anything can be a feature. What makes documents so interesting is that they are multi-faceted (or multi-dimensional to keep our spatial metaphors going). So your question needs to be, What dimensions of documents am I interested in studying, and also, what dimensions am I leaving out?

When you think of your features as a feature space it helps you conceptualize how you are representing your data. Remember above from the section on data selection. When you collect data you are creating a representation of some larger social category. You’re doing that here too when you construct your feature space. You are saying that you believe an appropriate way to represent your documents is through such and such features. Thus, you are creating a representation of a representation. There is no such thing as a natural set of features to study documents. Everything is constructed!

The key is to reflect on how your choice of features might impact your results and how a different set of features would or would not make a difference. If some outcome is always the same independent of what features you use then you know it is a very, very robust finding. If something is very sensitive to the process of feature selection then you know it is conditional on those choices. This is no less valuable, but it is important that you know the difference.

The best part about feature construction is that it is also one of the most creative parts of the job. What’s a feature? Whatever you can dream up. For starters, let’s begin with words. Ok, a bit vanilla, but it’s always good to start with the simplest option. This is actually a great mantra for data science in general:

Always start with the simplest option!

Even when you’re just using words as features, there are some very important choices to make. I.e. even the simple isn’t simple (another good mantra).

Always start with the simplest option!

Share this:

Leave a comment Cancel reply