[Davies/BYU] Corpus of Contemporary American English

This entire corpus is based on the n-grams that are provided by Google Books (detailed information; copyright). When you search the "corpus" at our site, you are actually searching these n-grams, rather than the actual Google Books (sentences, paragraphs, and pages of text). However, the frequency lists that you will see here contain links to Google Books, to see the actual occurrences in the texts. There are five important things to note about these n-grams:

1. Creating the corpus: Overview of the original files and how they were processed for this corpus.
2. Threshold of 40 tokens: Only those n-grams that occur a total of 40 times are included in the Google Books n-grams. More information...
3. Part of speech tagging: The n-grams are not tagged for part of speech, although we have created work-arounds for this. More information...
4. Case sensitivity: The Google Books n-grams are case sensitive, so "however there are" and "However there are" will have two separate entries. In the next few months, we'll add an option to allow you to conflate these two entries into one in the display.
5. Punctuation and apostrophes: The Google Books n-grams contain punctuation and apostrophes, such as:

1 . I have never CHART

2 idea ; nevertheless CHART BOOKS

3 talk about , CHART BOOKS

But as you can see from the links to the actual Google Books and the charts, these don't correspond well with each other (and some, like #1, don't have results in the books themselves). Since our n-grams data is the same as those used for the charts, it may be problematic as well for those strings that have punctuation. In other words, the n-grams will appear in our results like, but they won't actually find anything in the books at Google Books. This is a problem with Google Books and their data -- not our interface.