Key Concepts

Mathematical structures for literary study

First, a few definitions. Literary mathematics is the practice of representing critical concepts using formal expressions that describe relations among literature’s countable features. In this regard, literary mathematics provides the theoretical foundation for quantitative literary analysis. These two activities are very closely related and, at first, might even seem to be the same thing. However, I have found it useful to maintain this distinction as an informing heuristic:

When we interpret texts based on their quantified features, I will say we’re doing quantitative literary analysis. When we’re trying to explain more generally what aspects of textuality can be quantified and, once quantified, interpreted in what kinds of ways, I will say we’re doing literary mathematics. Thus, literary mathematics is the theory of quantitative literary analysis.

Is quantitative literary analysis limited to the study of "literature"?

No. The term "literary" does not refer to any particular genres of texts (poems, novels, &c.) nor to any particular mode (imaginative, creative, fictional, or whatever). It refers only very generally to any written discourse used to communicate meanings between persons — and so would exclude, perhaps, the study of some kinds of texts, like computer code. But it does involve the use of textual data to study history, politics, and culture, and so I use "quantitative literary analysis" interchangeably with phrases like "cultural analytics," "corpus-based inquiry," and "distant reading." These terms all signify – to me – the practice of using corpus data to describe and explain some aspect of human society, the record of which I refer to as "literature."

What does it mean to say literary mathematics is a "theory"?

Literary mathematics is not theoretical in the sense of a scientific theory, based on testing hypotheses (although it might involve testing hypotheses). Nor is it theoretical in the sense implied by the phrase literary theory, which suggests a body of authoritative knowledge available to be used for interpreting textual evidence (although it is used to interpret textual evidence). Literary math offers no causal explanation of natural phenomena. Nor does it provide any framework for explaining how literary interpretations might be generalized to larger social concerns. Rather, it provides a general set of principles for studying literature quantitatively.

These principles can be divided into two kinds, based on the sorts of questions they help answer.

Ontological. What kinds of textual phenomena exist? To what countable features are they analogous?
Analytical. How can one group of features be compared to another? What methods of comparison are appropriate? How should differences among textual objects be interpreted?

Quantitative literary analysis requires thinking hard about what it means to count textual things. If you’re going to convert a sequence of sonnets into some quantitative format, you’ll need to make a lot of decisions about what needs to be counted, and you’ll need to be able to explain how you’ve counted them and why you made the choices you’ve made. You’ll need a topology of the text, which includes both a body of textual data, called a corpus, and a list of categories for organizing that data, as well as a set of rules for how those categories relate to one another. You’ll also need an analytical framework that explains how measurements of literary data, expressed under some given topology, reveal significant differences across that topology’s categories. The purpose of this website is to provide a theoretical framework for quantitative literary analysis to help students and aspiring researchers perform this work with confidence and to describe it with clarity.

Why use quantitative methods?

Quantitative literary analysis can support textual interpretation at many levels, and it becomes more useful as the goals of analysis become more sophisticated.

Reading comprehension and paraphrase: Measurements can provide a general thematic summary of the text’s main topics and can evaluate the extent to which any given quotation is typical or atypical of the text.
Close reading: Measurements can provide detailed profiles of individual words and syntactical patterns, exposing connections that exist within a literary text and among multiple texts. At this level, quantification is often very useful for identifying statistical anomalies that reflect a text’s latent assumptions or a word’s latent associations, and therefore can be extremely useful for all aspects of intellectual history.
Historical contextualization: Depending on the textual contents of the corpus, measurements allow confident generalizations about how any historical document compares with others by the same author, of the same genre, or from the same period. Similarly, the enable thick descriptions of the various uses of individual words. At this level, quantitative analysis is extremely useful for describing how any text is situated within a larger historical field of similar documents or how any word contributes to a larger semantic field.

Notice, however, that none of the uses for quantitative literary analysis described above relate directly to literary theory, much of which hopes to explain how literature contributes to our socially experienced and ideologically constructed reality. Quantitative analysis gathers new evidence and creates new objects of knowledge and literary mathematics explain how textual objects exist in relation to each other.

What is a corpus?

The word "corpus" generally refers to a collection of documents stored digitally, usually as files recorded as plain text or marked up with XML. Every unique word in a corpus is called a type or word form, and every instance of each word is called a token. The words "dogs" and "cats" are two different types, but if the word "dogs" appear twice in a book, each instance is a token of the type.

Before continuing, it is worth pausing over the question of a token. What separates the type "dog" from the instance of the type "dog" is the fact that it "appears in" or "is used in" a given stretch of discourse. The token exists because it can be characterized as having both a kind and a location. For this reason, corpora are conventionally described along two axes of differentiation: syntagmatic and paradigmatic. If tokens appear together in the same document, paragraph, or sentence, they are syntagmatically related. If they are of the same type, they are related paradigmatically. Tokens as such can be identified and therefore counted only when these two kinds of relations can be observed simultaneously, when this kind of word (unlike or in addition to those words) appears in that document (instead of or in addition to those documents).

The fun thing about tokens is that they can be counted in lots of ways. If you count the number of tokens that exist in a document, you’re measuring that document’s length. If you count the number of tokens that share a common type, you’re measuring the frequency of that word. Both kinds of counting imply axes of difference that cut across the corpus in different different directions.

Figure 1. A very simple literary topology

In this simple visualization, we imagine a very small corpus described under a very simple literary topology — just four documents and seven word types. All tokens in the corpus can be identified as belonging to, being a member of, or being contained in any of these eleven simple subsets. This system of subsets describes the structure of the corpus — its topology.

One important thing to notice is that some kinds of subsets overlap and others don’t. No token can be of two different types. No token can be found in two different documents. But all tokens in every corpus have both a document location and a word form. This is what I mean when I say that words are counted at the intersections of kind and location. The number of tokens that exist at each intersecting subset (of which there are twenty eight) represents the frequency of the word in that document. The set of all such frequencies for a given word is called the distribution of that word across documents. The set of all such frequencies for a given document is the distribution of words in that document.

Most applications of quantitative literary analysis depend on structures like these. They describe words over a fixed bibliography of documents and describe documents over a fixed vocabulary of word types. Once these mutually defining distributions are recognized, it becomes possible to compare and contrast them quantitatively. The statistical tools for making such comparisons can get extremely complicated. However, underneath all the fancy applications of computational linguistics and machine learning lies this simple basic structure. Luckily, simple analytical procedures performed over structures like these are usually adequate for corpus-based historical inquiry.

Matrices — what they are and why you need to know them

You may have noticed that figure 1 above looks a lot like a spreadsheet, arranged by words that line up like rows and documents that align like columns. Once we count the number of tokens contained in each intersecting subset, we can gather those numbers into a spreadsheet-like structure called a matrix.

Let’s flesh out the example imagined in figure 1. Suppose each of our four documents is just a single sentence long.

Doc A: The travel agent will book your flights to Chicago and Los Angeles, but you’ll need to book your own ticket to Phoenix.
Doc B: The agent worried that the book was written in a style with too many flights of fancy.
Doc C: Let’s travel in high style to all the fancy travel destinations in Europe.
Doc D: The FBI agent considered her suspect’s written confession carefully.

If we decided to count only words that appear in more than one document and that were at least four characters long (thus removing words like "the," "in," and "to") we’d have a fixed vocabulary of seven word types: agent, book, fancy, flight, style, travel, and written. The frequencies of each word in each document then can be represented in tabular format:

	Doc A	Doc B	Doc C	Doc D
agent	1	1	0	1
book	2	1	0	0
fancy	0	1	1	0
flights	1	1	0	0
style	0	1	1	0
travel	1	0	2	0
written	0	1	0	1

A table as small as this can be scanned visually without too much trouble. If you look at the rows for fancy and style, you’ll see that they both appear in documents B and C and nowhere else, suggesting they’re very closely related. Comparing columns, we can see that document B shares a lot in common with both A and C, but that what it shares is different in both cases. Document D is a weird outlier, sharing little with the rest of the corpus.

However, as you can probably already tell, visually scanning the data in this way quickly becomes unwieldy. If we had a few more documents or included just a few more word types in our model, visual comparisons would be extremely difficult to parse — to say nothing of actual corpus data which often measures the frequencies of thousands of words over thousands of documents. For this reason, it’s useful and in almost all cases necessary to represent such tables mathematically. We can write this table as a matrix, V, with 7 rows and 4 columns:

$V = \begin{pmatrix} 1 & 0 & 0 & 1 \\ 2 & 1 & 0 & 0 \\ 0 & 1 & 1 & 0 \\ 1 & 1 & 0 & 0 \\ 0 & 1 & 1 & 0 \\ 1 & 0 & 2 & 0 \\ 0 & 1 & 0 & 1 \end{pmatrix}$

Once your corpus is represented in this format, it becomes possible to describe both words and documents as vectors — that is, as sequences of numbers or, to use the language from above, as distributions over fixed variables. For example, we can describe agent using a four-item sequence {1, 1, 0, 1} to represent its distribution across four documents, and Document C can be described as a seven-item sequence {0, 0, 1, 0, 1, 2, 0}, representing its distribution over seven word types.

Matrices are used in a wide range of studies, including but not limited to word-collocation data like these. Network analyses are often visualized with pretty-looking graphs that use points and lines to show clusters of data, but those clusters are inferred from matrices that sit under the hood. Geospatial analyses are often drawn on beautiful digital maps, but the information used to make such maps is usually stored in "attribute tables," which share a similar structure. When geographical patterns are described and mapped, they’re usually measured using matrices of geolocated data.

Quantifying qualitative difference: or, What is the dot product?

Virtually all forms of quantitative literary analysis use statistical descriptions of matrices to show variation across a corpus. To return to the example above, we can say that fancy is different from style, but that they share a common distribution. Both are represented by the vector {0, 1, 1, 0}. We should be able to specify further that some words have less in common, and we should be able to characterize how similar or different any two documents might be. The purpose of vectorization is to make such comparisons possible by measuring each word or document over fixed distributions. Rather than think of Document A as a collection of tokens, we think of it as a structured set of intersecting subsets of tokens (to refer to figure 1 above). Because documents B, C, and D are described over the same structure of subsets, statistical comparison becomes possible.

The basic procedure for performing such comparisons is to take any two vectors and combine their respective elements, usually by multiplations, then adding the combined elements together. The simplest and most common such procedure is known as the dot product, which, for any two length-n vectors, a and b, looks like this:

$a_1 \cdot b_1 = a_1b_1 + a_2b_2 + . . . + a_nb_n$

The dot product is a foundational concept because it shows how to reduce two sequences of numbers to a single value that describes their similarity — that is, the extent to which their constituent values overlap with each other. Within the matrix V above, we can say that the dot product between fancy and style is 2, because (0 x 0) + (1 x 1) + (1 x 1) + (0 x 0) = 2, while the dot product between travel and written is 0, because (1 x 0) + (0 x 1) + (2 x 0) + (0 x 1) = 0.

Usually, scholars using the dot product will use versions of the calculation that are slightly more complicated by normalizing the values one way or another, rather than using raw word counts. Examples of such calculations include cosine similarity, Kullback-Leibler divergence, and covariance.

Further topics

Check here for future updates, which will include tutorials for additional concepts important to corpus-based cultural analysis.