Conceptual Topography

The code below offers a walkthrough of Chapter 3: Conceptual Topography, showing how to replicate all data visualizations in that chapter. If you have not yet, be sure to install the litmath and litmathdata R packages, which include the necessary functions and data. Both packages can be installed from Github by opening RStudio and entering the following commands in the console.

devtools::install_github("michaelgavin/litmath@main")
devtools::install_github("michaelgavin/litmathdata@main")

Activate libraries and import data

Starting from a fresh R session with a clear working environment, the first step is always to activate the libraries and import the data.

library(litmath)
library(litmathdata)

data(geo)
data(place_year)
data(footprints)

The data imported above includes geographical metadata, geo, an R ‘list’ that includes the gazetteer, geocoordinates, and the list of all toponyms. The footprints dataset stores the output of geospatial calculations for each keyword and the place_year matrix records the frequency of each toponym in the EEBO corpus over time.

The first graph in the chapter, 3.1, "Geographical Spread of 2,000 Words in EEBO," visualizes how words are situated geographically in the corpus. That is, it is based on word-collocation data, where some of the words are toponyms associated with geocoordinates. "Geographical spread" identifies the centroid of latitude and longitude and the standard deviation — the average distance of each collocate place from that centroid.

Figure 3.1 is a scatterplot that selects the 2,000 most frequent terms (among the 7,507 keywords identified from the corpus) and plots them in a scatterplot with the x-axis representing each word’s collocates mean longitude and deviation.

hits = order(footprints[,"FREQ"], decreasing = T)[1:2000]
x = footprints[hits,"LON"]
y = footprints[hits,"RADIUS"]
plot(x, y, cex = 0, main = "x = line-of-fit, y = radius")
text(x, y, labels = rownames(footprints)[hits])

As with all data visualizations that plot words along x- and y-axes, you’ll notice differences between the graphs produced in R and what you see in the book. For legibility, I selected keywords for highlighting that seemed interesting, and at times I tweaked the locations of words that overlapped too much to read, and sometimes deleted words if too many were in the same locations to make visual reading possible.

Actually compiling the semantic footprint data is a somewhat laborious process, so readers hoping to replicate figure 3.1 from something like ‘scratch’ should see the section below, "Measuring Semantic Footprints." Other readers can continue just using the provided data.

The gazetter

Table 3.1 displays a subset of the gazetteer data.

gaz = geo$gazetteer
hits = which(gaz$SUBJECT %in% c("gotembourg", "gotha", "gothebourg", "gothen"))
gaz[hits,]

NUM	SUBJECT	PREDICATE	OBJECT	SOURCE
11691	gotembourg	is same as	gothebourg	A28561
11692	gotha	instance of	city	A37751
11693	gotha	is in	saxony	A37751
11694	gotha	is in	thuringia	A37751
11695	gotha	is same as	gothen	A28561
11700	gothebourg	instance of	city	A28561
11701	gothebourg	instance of	city	A37751
11702	gothebourg	is in	america	A28561
11703	gothebourg	is in	copenhagen	A37751
11704	gothebourg	is in	sweden	A28561
11705	gothen	instance of	city	A28561
11706	gothen	is in	franconia	A28561
11707	gothen	is in	misnia	A28561
11708	gothen	is in	saxony	A28561
11709	gothen	is in	thurigina	A28561

Notice that the gazetteer data does not include geocoordinates. It’s based on a subject-predicate-object "linked data" structure that identifies relationships between terms as testified in various EEBO-TCP documents. The geocoordinates (also taken from EEBO-TCP files) are stored in a separate dataset.

To visualize the distribution of values for latitude and longitude, figure 3.3 displays them in a simple scatterplot.

lonlat = geo$coordinates
x = lonlat$LON
y = lonlat$LAT
x[x >= 180] = x[x >= 180] - 360
plot(x,y, pch = 20)

Most of the geocoordinates were taken from the following sources: Laurence Echard’s The gazetteer’s, or, Newsman’s interpreter (1692 [A37751]), his A most compleat compendium of geography (1691 [A37760]), and Peter Heylyn’s Cosmographie in four books (1652 [A43514]). Because Echard’s Newsman’s interpreter is limited to places in Europe, it’s very possible that figure 3.3 exaggerates somewhat the Eurocentrism of EEBO’s geographical scope, and that many more places outside Europe were located with coordinates of latitude and longitude than I was able to find. It is also possible that gazetteers constructed from early modern maps, rather than from the EEBO-TCP files, would reveal different patterns of geographical interest.

However, I should also say here that such additional data-curatorial work is highly unlikely to change the substance of the analyses in this chapter. As I discuss in the chapter, references to various places occur in EEBO in the shape of a power-law distribution. The hundred most frequently mentioned places represent the majority of all toponymic references in the corpus. Most geographical locations are only very rarely referred to at all, and are therefore not major drivers of EEBO’s semantic structure. Thus, while a larger and more detailed gazetteer would be valuable for measuring various local histories across the early modern world, the larger picture represented in figure 3.1 is not likely to be substantially altered by such detail.

Place frequency analysis

Tables 3.2 to 3.7, and figures 3.4 to 3.7 all take advantage of the gazetteer’s linked data structure by either grouping toponyms that are variants of the same place name (like "France" and "Fraunce," or like "England" and "Albion") or by gathering toponyms that are spatially related by containment (like "Dublin" is in "Ireland"). This allows us to track how often places are mentioned explicitly, or how often a country or region is referred to in general. Sometimes we want to know when the term "Asia" was used, but other times we might want to count all references to "Persia," "India," and "China" as references to Asian places.

The code below looks fairly complicated, but it all just involves shuffling the toponyms in various ways. References to places across EEBO are stored in a place-document matrix — that is, a term-document matrix where the only terms counted are toponyms. Depending on the needs of each table and graph, the gazetter is consulted, the relevant rows are composed as needed (that is, the row of word counts for "fraunce" will be added with "france"), and the totals of each resulting row are taken.

# Analysis that follows assumes gathering spelling variations and alternate names have been compiled
data(place_doc)
data(place_year)
toponyms = rownames(place_doc)

# The toponym "world" has very strange properties, and perhaps is not a toponym at all,
# so I exclude it from analysis.
toponyms = toponyms[toponyms != "world"]

# Exclude duplicate toponym variations
duplicate_terms = c()
for (i in 1:length(toponyms)) {
  print(i)
  toponym = toponyms[i]
  sames = place_join(places = toponym, gaz = gaz, mode = "same")
  if (length(sames) == 1) {
    next()
  }
  if (length(sames) > 1) {
    top_place = names(sort(totals[sames], decreasing = T))[1]
    if (top_place != toponym) {
      duplicate_terms = c(duplicate_terms, toponym)
    }
  }
}

# Define list of unique places
places = setdiff(toponyms, duplicate_terms)

# Remove excluded terms and trim 15th century years
py = as.matrix(place_year)
py = py[toponyms, 27:226])

# Now reconstitute place_year by resolving frequency counts for
# alternate spellings
composed_py = matrix(0, length(places), 200)
rownames(composed_py) = places
colnames(composed_py) = colnames(py)
for (i in 1:length(places)) {
  print(i)
  place = places[[i]]
  sames = place_join(place, gaz, mode = "same")
  vec = place_composition(py, sames)
  composed_py[i,] = vec
}

#### Table 3.2 ####
totals = rowSums(composed_py)
sort(totals, decreasing = T)[1:10]

#### Tables 3.3 to 3.6 ####
places = c("europe","asia","africa","america",
           "china","india","greece","persia","france","germany",
           "italy","england","scotland","ireland","wales")
for (p in 1:length(places)) {
  place = places[p]
  hits = gaz$SUBJECT[which(gaz$PREDICATE == "is in" & gaz$OBJECT == place)]
  hits = intersect(hits, rownames(composed_py))
  totals = rowSums(composed_py)
  x = sort(totals[hits], decreasing = T)[1:20]
  if (p == 1) {
    places_df = data.frame(rep(place, length(x)), names(x), x)
    colnames(places_df) = c("place","places","freq")
  } else {
    df = data.frame(rep(place, length(x)), names(x), x)
    colnames(df) = c("place","places","freq")
    places_df = rbind(places_df, df)
  }
}
# The 'places_df' data frame, defined in the above look has 
# all data for tables 3.3 to 3.6, despite being formatted
# somewhat differently.

#### Figure 3.4 ####
freq_dist = table(totals)
y = as.numeric(freq_dist)
x = as.numeric(names(freq_dist))
plot(x, y, log = "x")
abline(v = median(totals), col = "red")

#### Table 3.7 ####
ranges = list(1:100,101:140,141:160,161:180,181:200)
for (i in 1:length(ranges)) {
  subtotals = rowSums(composed_py[,ranges[[i]]])
  subtotals = sort(subtotals, decreasing = T)[1:5]
  subtotals = round(subtotals / 1000, digits = 1)
  if (i == 1) {
    top_five = matrix(names(subtotals),5,1)
    top_five = cbind(top_five, subtotals)
  } else {
    top_five = cbind(top_five,names(subtotals))
    top_five = cbind(top_five, subtotals)
  }
}


#### Figure 3.5 ####
plot_timeseries(composed_py, places = c("england", "rome"), compose_places = F)

#### Figure 3.6 ####
vec = apply(composed_py, 2, function(x) { length(x[x>0])})
plot(vec)
abline(lm(vec ~ c(1:200)), col = "red")

#### Figure 3.7 ####
continents = c("europe","asia","africa","america")
plot_timeseries(composed_py, places = continents)

Figure 3.10 has three layers. The first is a closeup of all points with geographical coordinates, focused on Europe.

all_lat = geo$coordinates$LAT
all_lon = lonlat$LON
all_lon = geo$coordinates$LON
all_lon[all_lon > 180] = all_lon[all_lon > 180] - 360
ord = which(all_lat > 20 & all_lon < 80 & all_lon > 0)
plot(all_lon[ord], all_lat[ord], pch= 20, col = "gray")

The second is of the footprints of each term.

lon = footprints[,"LON"]
lat = footprints[,"LAT"]
points(lon, lat, pch = 20, col = "black")

The third is the center point and line of best fit.

points(mean(lon), mean(lat), pch = 20, col = "red")
lines(footprints[,"LGD_LON"], footprints[,"LGD_LAT"], lwd=1, col="red")

National, regional, and global analysis

The visualizations that are at the center of the final section of the chapter are all closeups of a single formula, very similar to figure 3.1 and differing only in that the x-axis is measured along the line of best fit, rather than the simple average longitude.

As with figure 3.1, please keep in mind that the R graphics generated will differ slightly from those in the book. Because of the difficulty visualizing large numbers of words, the graphs that appear in the book were tweaked by hand for legibility, and the closeups used slightly different parameters here and there. For example, the region surrounding "europe" overlaps a great deal with that of "mediterranean," so figure 3.16 is actually slightly offset, potentially exaggerating Europe’s association with Asia. (My goal there was to ensure the region between Europe and Asia was fully represented in my analysis.)

The base graphs can be generated using the code below.

# Make atlas. Begin by filtering out short and low frequency words
words = geo$keywords
words = intersect(words, 
                  rownames(footprints[order(footprints[,"FREQ"], decreasing = T)[1:5000],]))
words = words[nchar(words) > 3] # Removes words with fewer than 4 letters
words = c(words,"god","man","men","old","new","sun",
          "mediterraneansea") # But adds these back in

x = footprints[words,"LGD_LON"]
y = footprints[words,"STDEV"]
x = x[words]
y = y[words]
fit = lm(y ~ x)
words = intersect(words, names(fit$residuals[fit$residuals < 0]))

x = footprints[words,"LGD_LON"]
y = footprints[words,"RADIUS"]
fit = lm(y ~ x)

# Converting y-axis to residuals turns 
y = fit$residuals

# Plot overview map with line of best fit
plot(x, y, pch = 20, main = "overview map")
abline(h = 0, lwd = 2, col = "red")

# Now for the purpose of distance measurements using logarithm, convert y
# to all positive values and bind adjusted lonlat values 
y = y + (-1 * min(y))
xy = cbind(log(x),log(y))

# For any given word, find the nearby words
place = "asia"

# Find the distance that separates that word from all others
d = apply(xy, 1, euclidean_distance, vec2 = xy[place,])

# Select the closest words
hits = order(d)[1:175]

# Build the plot
plot(x[hits], y[hits], cex = 0, main = place)
text(x[hits],y[hits],labels=words[hits])

Measuring Semantic Footprints

Above, figure 3.1 was produced using the footprints dataset. To recreate that dataset on your own, run the code below. Please note that even this is not quite ‘raw’ data (if by ‘raw’ you mean not only uncooked but completed unprocessed) because it relies on term-document matrices that have already been assembled from EEBO documents.

# Note the "import_data()" function, which is needed to compile this
# larger keyword by document matrix, unlike most data which can be 
# loaded simply with "data()"
keyword_doc = import_data("keyword_doc")
data(place_doc)

# Use matrix multiplication to get place-keyword collocation
pk = place_doc %*% t(keyword_doc)

# Load geocoordinates for places
data(geo)
lonlat = geo$coordinates
# Center at 180
lonlat$LON[lonlat$LON > 180] = lonlat$LON[lonlat$LON > 180] - 360


# Define semantic footprint function
semantic_footprint = function (mat, term, coords) {
  vec = mat[, term]
  vec = vec[vec > 0]
  vec = vec[names(vec) %in% coords[, "NAME"]]
  hits = which(coords[, "NAME"] %in% names(vec))
  vec = vec[coords[hits, "NAME"]]
  lat = coords[hits, "LAT"]
  lon = coords[hits, "LON"]
  lon[lon >= 180] = lon[lon >= 180] - 360
  mean_lon = sum(lon * vec[vec > 0])/sum(vec)
  mean_lat = sum(lat * vec[vec > 0])/sum(vec)
  lon = coords[hits, "LON"]
  distances = c()
  for (i in 1:length(lon)) {
    lati = lat[i]
    loni = lon[i]
    d = great_circle_distance(lon1 = mean_lon, lat1 = mean_lat, 
                              lon2 = loni, lat2 = lati)
    distances = c(distances, d)
  }
  mean_dist = sum(distances * vec)/sum(vec)
  sd_dist = sd(distances)
  results = list()
  results$n = length(vec)
  results$freq = sum(vec)
  results$sd = sd(vec)
  results$lon = mean_lon
  results$lat = mean_lat
  results$radius = mean_dist
  results$stdev = sd_dist
  return(results)
}


# Build semantic footprints
footprints = matrix(0, 7, ncol(pk))
colnames(footprints) = colnames(pk)
for (j in 1:ncol(pk)) {
  print(j)
  res = semantic_footprint(term = colnames(footprints)[j],
                           mat = pk,
                           coords = lonlat)
  res = unlist(res)
  footprints[,j] = res
}
footprints = t(footprints)
colnames(footprints) = names(res)
colnames(footprints) = toupper(colnames(footprints))

# Identify the line of geographical difference (line of best fit)
lon = footprints[,"LON"]
lat = footprints[,"LAT"]
fit = lm(lat ~ lon)
estimated_x = seq(min(lon),max(lon), length.out = length(lon))
estimated_y = fit$coefficients[2] * estimated_x + fit$coefficients[1]
estimated_mat = matrix(c(estimated_x, estimated_y), 7507, 2)

euclidean_distance = function(vec1, vec2) {
  return(sqrt(sum((vec1 - vec2)^2)))
}

# Get position for each term along the line of best fit
lgd = matrix(0, 7507, 2)
rownames(lgd) = rownames(footprints)
for (i in 1:nrow(footprints)) {
  print(i)
  lati = lat[i]
  loni = lon[i]
  veci = c(loni, lati)
  results = apply(estimated_mat, 1, euclidean_distance, veci)
  hit = which(results == min(results))
  lgd[i,] = estimated_mat[hit,]
}
colnames(lgd) = c("LGD_LON", "LGD_LAT")
footprints = cbind(footprints, lgd)
colnames(footprints) = toupper(colnames(footprints))