To see an example of this, search for into revealing and then click on the 1950s bar to see the examples from the 1950s at Google Books. The second and fourth entries contain the phrases maneuvered them into revealing and trap the husband into revealing. But when you search for either of these phrases by themselves (click on them to see), there are no entries. That's because neither of these four or five word strings occurs at least 40 times, and they are therefore "invisible". The issue of the "40 token threshold" is important, because it means that although Google Books is much larger than a corpus like the 400 million word Corpus of Historical American English (COHA) and although it will almost always have more tokens (total occurrences), it may have just about the same number of unique strings (types) as COHA, or perhaps even fewer. Consider the following examples:
This becomes much more of a problem for Google Books for longer strings -- 4-grams and 5-grams -- where there are more total possibilities for each string, and less chance that any given string will occur the required 40 times. Consider the following table:
In #1, only the [j*] slot has a fairly wide range of possibilities, and so COHA has only a few more types (unique strings) than the American English dataset from Google Books. But in #2, the first and (especially) fourth slots have lots of possibilities, and that's why COHA has about six times as many types, even though it's a much smaller corpus. Overall, then, Google Books nearly always has many more tokens and since each type occurs at least 40 times, you can be quite sure that they are not typos or other anomalies. On the other hand, the 40 token threshold means that sometimes the results suffer in terms of the number of types.
|