The Amazing Language Abacus

Google has unleashed part of its massive data-mining project – the Google Books Ngram Viewer. According to The New York Times, it’s a “…mammoth database culled from nearly 5.2 million digitized books available to the public for free downloads and online searches, opening a new landscape of possibilities for research and education in the humanities.”

On it’s face, the Ngram viewer is a toy, an online diversion. The basic interface allows you to enter phrases and compare their appearance in Google’s extensive archive. The default settings allow you to search from 1920-2000, but it seems that searches can start at 1500 AD — 50 years or so after the arrival of the Gutenberg Bible & movable type.

The Times notes that the word “woman” was nearly absent from the archive until the ascent of feminism in the early 1970s.  A quick test drive of the viewer turned out some interesting results. The interface encourages our tendency to think in simple-minded binaries, but given that we think certain pairings are intimately coupled, the skewed appearance of certain terms is a bit jarring. For instance, “religion” remains steady throughout, though the concept has received less attention since the 80s. For that matter, so does secularism –  but according to the Ngram viewer “secularism” doesn’t really register as an object of human interest. Likewise with “atheism.” “Fundamentalism” is statistically irrelevant, too.

“Gay” lost a bit of steam at the end of the 1930s, but came back with spirit in the 1980s. “Homosexual” has been in decline for awhile.

“Black”, “African American”, “Negro”, and some racist pejoratives create the arc that one would expect. “White” falls the 40s to the 60s, before taking a brief uptick in the mid-1970s. “Black” does just about the same, achieving a bit of harmony with “White” around the mid-90s. Of course, denuded of context one has no way of knowing if we’re talking about optics, ink, or race. The same applies to the previous example: “straight” trumps both “gay” and “homosexual,” but who knows if we’re comparing apples and oranges, carpentry with politics.

There are less sexy, more useful applications of the data, though. The Times:

They tracked how eccentric English verbs that did not add “ed” at the end for past tense (i.e., “learnt”) evolved to conform to the common pattern (“learned”). They figured that the English lexicon has grown by 70 percent to more than a million words in the last 50 years and they demonstrated how dictionaries could be updated much more rapidly by pinpointing newly popular words and obsolete words.

This is quite interesting, actually.

The “they” in the preceding quote are a set of researchers that have worked with the data spearheading the study of what they’re calling “culturnomics.” How did they use this data – the contents of nearly 4% of the contents of the history of publishing?

Working with a version of the data set that included Hebrew and started in 1800, the researchers measured the endurance of fame, finding that written references to celebrities faded twice as quickly in the mid-20th century as they did in the early 19th. “In the future everyone will be famous for 7.5 minutes,” they write.


