Take a look at the picture to the left. If the large white square is the American English dataset of Google Books (155 billion words), then the small blue square is the Corpus of Historical American English (400 million words), at 1/400th the size. And then the tiny little white dot inside the blue square represents a 1,000,000 word corpus like the Brown corpus or the American part of ARCHER, at 1/400th the size of COCA. In other words, a 1,000,000 word corpus is about 1/160,000th the size of the Google Books corpus.

A small 1,000,000 corpus like Brown or ARCHER can be carefully, manually crafted to have just the right type of texts for each historical period. COHA is also balanced by genre, decade by decade. So that may give those corpora certain advantages over Google Books. (Nevertheless, there is an interesting question of how "representative" of the "whole" a corpus needs to be, when its size and scope ("scan all the books in this huge university library") essentially makes it "the whole" already.)

However, there is little doubt that the larger a corpus is, the less we are artificially restricted in terms of the phenomena that we can look at. With small corpora like Brown or ARCHER, even the best and the brightest researchers in the field of historical English are limited to looking just at high frequency syntactic constructions like modals, auxiliary verbs, and relative pronouns, and these have been the "bread and butter" of countless researchers during the past 15-20 years. With COHA, our horizons open up dramatically, and we can look at lexical change, morphological change, semantic change, and many types of syntactic change (such as verbal subcategorization), which would be utterly impossible with the small, one million word corpora.

With an incredibly large corpus like Google Books, it's not so much that there are lots of new kinds of phenomena that we can look at. Unlike the comparison between COHA and BROWN+ or ARCHER -- which is dramatic -- most of the phenomena that can be studied with Google Books can also be studied quite well with COHA. And often, COHA gives nearly as many types (unique forms) as Google Books (due in large part to the 40 token threshold for n-grams in Google Books). It's just that -- with 400 times the size of COHA -- we are able to "flesh out" the data for lower-frequency constructions.

To take one quick example (which could be multiplied many times over), there are 376 types (unique forms) in COHA for the construction "walked ADV-ly" (walked quickly, defiantly, etc). However, almost half of these types occur just one time. Do these adverbs really belong with walked? Are they typos? With just one token, it's hard to know. In the American English dataset of Google Books, there are 398 types, or just a bit more than in COHA. But because of the 40 token threshold for n-grams in Google Books, each of these occur at least 40 times, and we don't have to worry nearly so much about typos or whether the two words occurred together just once "by chance". That's the value of having a huge corpus like Google Books.