Comparing the Google Books datasets
(especially British and American English)

Google Books has datasets for a number of different languages and dialects. In the Advanced / English-Corpora.org interface to the Google Books data, we have incorporated the following datasets of English: American English, British English, Fiction, and the One Million Books collection. With just one click, users can seamlessly move between the different datasets results from the different data.

With the American and British datasets, it should in theory be possible to compare these two dialects over time. Let's discuss whether this is in fact possible with the Google Books data.

DID GOOGLE SEPARATE THE AMERICAN AND BRITISH BOOKS CORRECTLY?

The first problem in comparing dialects, though, is knowing whether -- as the books were scanned in -- they were correctly categorized as British or American. Ideally, it would be nice to see the frequency data from the n-grams for a particular dataset, and then click to see the excerpts from that dataset (e.g. American or British), to see whether the books are from the right dialect. But this isn't possible.

Let's take a simple example. Search for basically in American and British English -- in either our interface (Am, Br) or the standard Google Books interface (Am, Br). (Note: to come back directly to this page, click on RETURN in upper right-hand corner.) In our interface, it shows that there are just 13 tokens from the 1870s in American English, but when you click on the bar for the 1870s in the chart, it shows about 2,780 results. And then when you click on the bar for the 1870s in the results from British English, instead of the expected 4 tokens, it again shows 2,780 results. What's wrong?

You might think at first that our data is wrong, but it's not. Our data comes from the same n-grams files that are used for the standard Google Books site. That's why our charts are exactly the same as theirs, decade by decade. (Since their charts don't actually show the number of tokens per decade, you just have to look at the "shape" of the curve over time).

But you can also download the n-grams yourself, and check our data against those. For example, download file 2 (of 10) the data for 1-grams (single words) in American English from the Google n-grams site (note: it's big -- about 123 MB in size). On lines 17,767,875 - 17,768,007 you'll find the data for basically (copied here). Add up the totals per decade, plug them into a spreadsheet, and you'll get exactly the same chart as we have. You could also do the same for British English (Google Books file: lines 10,629,302-10,629,412). So, yes, our data is right.

But if it says that there are 13 tokens in the US and 4 tokens in the UK in the 1870s, why do the Google Books excerpts show many more tokens -- 2,780? That's not even close to the n-grams counts. It's because (apparently) there has been lots and lots of data that has been added to Google Books since the original n-grams were generated. Lots. So the n-grams frequency charts and the book excerpts are now wildly out of sync.

The second issue is perhaps even more frustrating in terms of comparing American and British English. Notice that we get the exact same total of 2,780 tokens, regardless of whether we clicked on the frequency chart of American or British English. Why is this? It's because the n-grams frequency charts do show the frequency for a particular dialect. But -- crucially -- in the books excerpts, all of the datasets get mixed together. In other words, it is completely impossible to see the book excerpts from just American or just British English.

If we can't see the excerpts from a particular dialect, then how do we know whether the books have been categorized correctly by dialect, as they were being scanned? Answer: WE CAN'T. We simply have to take it "on faith" that the Google Books people categorized them correctly.

USING THE TWO DATASETS TO LOOK AT DIALECTAL DIFFERENCES

In spite of the serious problems shown above -- and the fact that we basically have to take the Google Books n-grams on faith -- is there still any evidence for differences between British and American English?, based on the Google Books data? We believe that there is -- at least to some degree.

Download spreadsheet with data for these four phenomena (Note: to come back directly to this page after clicking on a link to do a query, click on RETURN in upper right-hand corner)
	Chart #1 to the left shows that gotten (vs got: he'd gotten more than enough) is much more common in American English (red line) than in British English. It is interesting that the two dialects were roughly similar until about 1900, when they diverged. The Google Books data also agrees with the COHA data (see spreadsheet), which shows the largest increase from the 1920s-1930s. The data also suggests that British English is moving slightly towards the "American" gotten in the last 20 years, but this is much less likely. In the British National Corpus, gotten is still at only about 1.5% of all tokens (got, gotten), not the 17% shown in Google Books-- British for the 2000s (see spreadsheet for BNC data). This suggests that with this Google Books data, some American books (with gotten) were mistakenly categorized as British. (On the other hand, the BNC is now 20-30 years out of date, so the 5.9% gotten in Google Books -- British in the 1980s is closer to the 1.5% in the BNC from the 1980s-1993. Perhaps there has been an increase in gotten in British English since the early 1990s, but with no large, balanced corpus of contemporary British English, we'll never know.) Data: American: gotten, got (COHA: gotten, got) British: gotten, got
	British English strongly prefers HAVE + proved, whereas American English allows both proved and proven (see discussion). Chart #2 does show a greater preference for proven in American English. It also suggests that proven has increased since about the 1930s-1950s, which agrees with the COHA data (see spreadsheet)). As with the data for got/gotten (#1 above), we are again faced with a problem for British English for the 1980s-2000s. Google Books shows that proven is increasing, but 25% proven for the 2000s seems unlikely. In the BNC, it is about 5% (see spreadsheet), and Google Books shows about 8% for the same period -- the 1980s. But without an updated British corpus since the early 1990s, it's impossible to know exactly what's going on. Data: American: proven, proved (COHA: proven, proved) British: proven, proved
	American English allows different from and different than, while British English rarely allows than, but it does allow to (it is quite different to ours). (In all of these examples, we limited the search to [be] different to the (e.g. it is different to the one we have), to exclude spurious cases like everyone has something different to offer.) The Google Books data does show that to is much more common in British English (blue line). But the 3.7% to in Google Books -- British for the 1980s seems quite a bit lower than the 12.7% to in the BNC (see spreadsheet). One possible explanation for this is that Google Books has categorized some American books (without to) as British, thus providing too low of a figure for British. Data: American: from, than, to British: from, than, to
	While American English used to treat "possessive have" (I have a car) as an auxiliary, it is now treated as a main verb and it takes "do support" in negation and questions (e.g. do you have a car?) In British English, however, possessive have can still be considered an auxiliary verb, and thus there is inversion and no do in questions (e.g. have you a car, cf. have you seen him?). Chart #4 shows this very nicely. It shows that American English began to diverge from British English in the early 1800s, and that there has been a constant and sustained increase in +do since then, which agrees nicely with the COHA data (see spreadsheet). In the BNC (see spreadsheet), +do is very much the norm in spoken, informal British English -- at 90% or more. The lower 40-50% +do in Google Books -- British may be due to the fact that there are more formal books in Google Books than in the BNC, or at least less books with second-person questions like "do you have a". Data: American: -do, +do (COHA: -do, +do) British: -do, +do
In summary, some data (like that shown in the four phenomena above) suggest that Google Books has a fairly good job of separating books into American and British. But there is still a lot of "messiness" in the data -- perhaps as much as 10-20%. While it might be fine to use Google Books as a "starting point" to look at differences in British and American English, it probably makes sense to use a more accurate corpus for fine-grained studies. For American English, we do have such a corpus in the 400 million word Corpus of Historical American English (COHA). Unfortunately, there is currently no such large, accurate corpus of historical British English.

Mark Davies
Professor, Corpus Linguistics/p>

Comparing the Google Books datasets(especially British and American English)

Comparing the Google Books datasets
(especially British and American English)