Networks and Bibliographical Metadata

The code below offers a walkthrough of Chapter 1: Networks and the Study of Bibliographical Metadata, showing how to replicate all data visualizations in that chapter. If you have not yet, be sure to install the litmath and litmathdata R packages, which include the necessary functions and data. Both packages can be installed from Github by opening RStudio and entering the following commands in the console.

devtools::install_github("michaelgavin/litmath@main")
devtools::install_github("michaelgavin/litmathdata@main")

Activate libraries and import data

This chapter makes use of the igraph R package, and so you’ll need to install and activate that package, as well as litmath.

library(igraph)
library(litmath)
library(litmathdata)

Basic graph of the whole

All data for the chapter is drawn from a co-publication network of authors, printers, and booksellers, drawn from the Phase I release of the Early English Books Online corpus, published by the Text Creation Partnership (EEBO-TCP). The data is stored as an igraph data object, called eebo_network in the litmathdata package. The chapter begins with a simple visualization of the entire graph — or, more accurately, a subset of the whole, considering only people (nodes) who have at least twenty five connections. This filters out low-frequency nodes, who would otherwise clutter the visualization. Here’s the code for reproducing figure 1.2.

data("eebo_network")
g = eebo_network
subg = subgraph(g, which(degree(g) > 25))
V(subg)$name = ""
plot(simplify(subg), 
     layout = layout_with_drl(simplify(subg)),
     vertex.color = V(subg)$clust, 
     vertex.size = log(degree(subg)))

You’ll notice that each node is assigned a color. Those were based on a community-detection algorithm that has was run during analysis, the outcome of which is stored directly in the graph data. As I discuss in the chapter, the nodes of this graph sort loosely chronologically. People are most likely to be connected in the graph if they lived around the same time. The central nodes (light blue and green) represent members of the book trade who tended to be most active during the middle of the seventeenth century.

Next is the code for replicating Table 1.1, which compares top nodes by degree and betweenness. Using the igraph package, as you can see, the commands themselves are very simple.

d = degree(g)
b = betweenness(g)

# Top by degree
sort(d, decreasing = T)[1:10]

# Top by betweenness / degree
hits = which(d > 10) # To remove minor figures & outliers
sort(b[hits]/d[hits], decreasing = T)[1:10]

Analyzing the degree distribution

The degree is a measurement that indicates how many connections each node has. The degree distribution says how many nodes there are for each degree — how many people have just one connection, how many have two or three, and how many have hundreds. This is easy to find in R by just using the table() function to sort degree into groups.

Following conventions in network science, I refer to the degrees as k and the degree distribtion as p(k) or, more simply, pk.

This is Figure 1.4.

pk = table(d)
k = as.numeric(names(pk))
pk = as.numeric(pk)

plot(k, pk)

Personally, I still find degree-distribution plots to be somewhat confusing. The x-axis shows the degrees of the various nodes — the most central figure has almost 800 connections, and he’s located in the bottom right. The y-axis shows how many people have that number of connections. In the upper left, you can see that thousands of people in the EEBO metadata have just one or two connections.

Precisely because power-law distributions are hard to read and all look the same, they’re often visualized by taking the logarithms of the k and p(k). This is visualized in figure 1.5.

plot(log(k), log(pk))
hits = which(log(k) > 1 & log(k) < 5)
abline(lm(log(pk)[hits] ~ log(k)[hits]))

Network statistics by book trade role

A key feature of the early modern book trade is the designation of individuals into roles that are fairly distinct. Most participants in the trade are either authors, booksellers, or printers.

Figure 1.6 (left)

role_counts = table(V(g)$role)
barplot(role_counts[c("author","bookseller","printer")])

Figure 1.6 (right), shows boxplots of degree by role.

hits = which(V(g)$role %in% c("author","bookseller","printer"))
boxplot(d[hits] ~ V(g)$role[hits], outline = FALSE)

There are, of course, other kinds of roles in the EEBO metadata — engravers, dedicatees, etc. But author, bookseller, and printer are by far the most common in the data.

In figure 1.6 (right), the boxplots have outliers removed. That’s the outline = FALSE argument in the boxplot() function. To create figure 1.7, just run the same command but leaving boxplot()‘s default setting, which includes outliers.

boxplot(d[hits] ~ V(g)$role[hits])

The output of your R graphics won’t look much like figure 1.7. I deleted the boxes themselves and manually added labels for people I thought readers might find interesting.

Measuring change over time

All the measurements for tracing change over time in figures 1.8 to 1.11 are generated by taking subgraphs of the EEBO network over a trailing 10-year period. To see how the network was queried — and also just to examine a sample for loop for data curation — the code is reproduced below.

#### Change over time ####
# Set empty variables
YEAR = c()
TITLES = c()
NODES = c()
EDGES = c()
CENTRALIZATION = c()

# Begin for loop
for (i in 1500:1699) {
  # Create a subgraph for each trailing 10-year window
  start = i - 9
  end = i
  hits = which(E(g)$date >= start & E(g)$date <= end) # selects edges within the time window
  subg = subgraph.edges(g, hits)
  
  # Gather basic data for each year
  YEAR = c(YEAR, end)
  TITLES = c(TITLES,length(unique(E(subg)$tcp)))
  NODES = c(NODES, vcount(subg))
  EDGES = c(EDGES, ecount(subg))
  CENTRALIZATION = c(CENTRALIZATION, centr_eigen(subg)$centralization)
}
network_data = data.frame(YEAR, TITLES, NODES, EDGES, CENTRALIZATION)

Once the data has been collected, creating the time-series graphs is simple.

# Figures 1.8, 1.9, 1.10, and 1.11
attach(network_data) # this is a way to simplify R's syntax when working with data frames
plot(YEAR, TITLES, type="l")
plot(YEAR, NODES, type="l")
plot(YEAR, EDGES, type="l")
plot(YEAR, EDGES / NODES, type = "l")
hits = which(CENTRALIZATION > .99)
abline(v = YEAR[hits])

The final visualization of this section (fig. 1.2) is a network graph that shows only relations among printers.

hits = which(V(g)$role == "printer")
subg = subgraph(g, hits)
hits = which(degree(subg) > 5)
subg = subgraph(subg, hits)
V(subg)$name = ""
V(subg)$color = V(subg)$clust
V(subg)$size = log(degree(subg)) + 2
plot(simplify(subg))

Historical Periodization

One of the more interesting findings of the chapter, I think, involves the way community-detection algorithms sort bibliographical metadata into groups that are loosely chronological. The ‘looseness’ is what interests me, because it suggests that the model is picking up on aspects of real lived experience — people tend to be together in the network if they lived at the same time and might even have known each other — while also accounting for the ways that textuality traverses time and space, such that Erasmus and Virgil can be thought of as participants in the London book trade, where Shakespeare was as intimately connected to his Restoration-era adapters and commentators as to his meat-space contemporaries.

The communities discussed in the rest of the chapter were generated using an extremely simple walktrap.community() algorithm that is part of the igraph package, using its default settings. You are free to generate the communities yourself by running the following command:

wt = walktrap.community(g)

However, because there’s some randomization built in, there’s a very good chance that the communities won’t exactly match what’s in the book. To make things easier, I just saved the communities I detected into the eebo_network data itself.

I could have / should have done more robustness testing on the communities identified in this chapter. That is to say, I should have tested various community-detection algorithms systematically to compare results and to ensure that the conclusions reached in the chapter are not an artificial artifact of the walktrap.community() function’s default settings. I’m confident the argument of the chapter will hold up well under such scrutiny, but more importantly it would be fun and fascinating to see how various algorithms offer different perspectives of the network, like diachronic snapshots of the whole.

In any case, below is the code for finding the top authors and stationers for each period-community, as listed in table 1.2.

# Limit to seven largest communities
comms = 1:7

# Table 1.2
d = degree(g)
for (i in 1:length(comms)) {
  comm = comms[i]
  hits = which(V(g)$role == "author" & V(g)$clust == comm)
  authors = sort(d[hits], decreasing = T)[1:10]
  hits = which(V(g)$role != "author" & V(g)$clust == comm)
  stationers = sort(d[hits], decreasing = T)[1:10]
  if (i == 1) {
    df = data.frame(names(authors), 
                    authors, 
                    names(stationers),
                    stationers)
  } else {
    df = cbind(df, data.frame(names(authors), 
                              authors, 
                              names(stationers),
                              stationers))
  }
}

Figure 1.13 displays the local communities of Ben Jonson. This involves finding his community from the whole, then ‘zooming’ down two more levels and limiting the field to authors only. This has no real meaning, but it’s a way of filtering large networks so you can tease out a comprehensible and visualizable portion of the whole.

# Find Jonson's community and create the subgraph
subg = induced_subgraph(g, V(g)[V(g)$clust == 7])

# Then, within that subgraph, perform another community detection
wt = cluster_walktrap(subg, steps = 5)

# Limit, again, to the community Shakespeare is in
comm = wt$membership[which(wt$names == "Jonson, Ben, 1573?")]
subg = induced_subgraph(subg, V(subg)[wt$membership == comm])

# Limit to authors only
subg = induced_subgraph(subg, V(subg)[V(subg)$role == "author"])

# But we still have 280 nodes -- so repeat process
comms = clusters(subg)
comm = comms$membership[which(V(subg)$name == "Jonson, Ben, 1573?")]
subg = induced_subgraph(subg, V(subg)[comms$membership == comm])

# Now prepare the plot itself
V(subg)$size = 2 + log(degree(subg))
pg = simplify(subg)
V(pg)$label = V(pg)$name
plot(pg)

Then, for figure 1.14, do the same thing, but with Shakespeare.

# Begin by selecting community of Shakespeare (clust == 1)
subg = induced_subgraph(g, V(g)[V(g)$clust == 1])

# Then, within that subgraph, perform another community detection
wt = cluster_walktrap(subg, steps = 5)

# Limit, again, to the community Shakespeare is in
comm = wt$membership[which(wt$names == "Shakespeare, William, 1564-1616.")]
subg = induced_subgraph(subg, V(subg)[wt$membership == comm])

# And limit to authors only
subg = induced_subgraph(subg, V(subg)[V(subg)$role == "author"])

# But we still have 250 nodes -- so repeat process
comms = clusters(subg)
comm = comms$membership["Shakespeare, William, 1564-1616."]
subg = induced_subgraph(subg, V(subg)[comms$membership == comm])

# But there are a lot more authors around Shakespeare in the
# Restoration, so there's one last round of filtering we
# didn't need for Jonson
d = degree(subg)
subg = induced_subgraph(subg, V(subg)[d > 9])
V(subg)$size = 2 + log(degree(subg))
pg = simplify(subg)
V(pg)$label = V(pg)$name
plot(pg)

To compose the final data visualization, figure 1.15, use the code below.

clusts = V(g)$clust
comms = 1:7

for (i in comms) {
  print(i)
  comm = comms[i]
  subg = induced_subgraph(g, clusts == comm)
  dates = E(subg)$date
  label = round(mean(dates), digit = 0)
  if (i == 1) {
    df = data.frame(dates, 
                    rep(label, length(dates)), 
                    rep(comm, length(dates)))
  } else {
    df = rbind(df, data.frame(dates, 
                              rep(label, length(dates)),
                              rep(comm, length(dates))))
  }
}
colnames(df) = c("YEAR","RANGE","COMM")
df$RANGE = as.factor(df$RANGE)
df$COMM = as.factor(df$COMM)
boxplot(df$YEAR ~ df$RANGE, 
        horizontal = T, 
        # width = table(df$RANGE),
        width = table(df$COMM),
        las = 1,
        outline = F)
abline(v = 1640)
abline(v = 1660)

Last thoughts

As mentioned above, the data and analysis in this chapter were drawn from the Phase I release of EEBO. The network is provided online and represents, I believe, a fairly comprehensive and highly accurate representation of the early modern London book trade. The data is easy to measure and experiment with. That said, for the purposes of further research, scholars are likely to prefer working with networks drawn from all of EEBO, or perhaps the entire ESTC. The term bibliographic data science was recently coined by Leo Lahti, Jani Marjanen, Hege Roivainen, and Mikko Tolonen to describe this work.