Corpora AI / LLMs Overview PDF Videos Resources Help / FAQ My account † corpus.byu.edu

English-Corpora.org

In addition to the regular corpus interface, there are a wide range of other corpus-based resources, some of which allow you to download large amounts of data for offline use. (Compare to academic license)

Full-text	Download full-text data for iWeb, COCA, COHA, GloWbE, NOW, Coronavirus, Wikipedia, SOAP, the TV Corpus, the Movies Corpus. With this data, you will have the texts from the corpora on your own computer, rather than having to use the web interface. The data comes in three formats: relational database, word/lemma/PoS (vertical format), or text (linear format).
Word Frequency	Download lists of the top 60,000 lemmas in COCA, including the frequency by the eight main genres and nearly 100 sub-genres. You can also download a list with the frequency of the word forms (e.g. decide, decides, deciding, decided), as well as a list of the top 219,000 words (not lemmas) in COCA, including frequency by genre.
Collocates	Download lists with the top 200-300 collocates (nearby words) for 60,000 different lemmas -- 13,500,000 node/collocate pairs in all.
N-grams	Download lists (in various formats) of all 2, 3, 4, and 5-word strings that occur at least four times in COCA -- more than 40 million n-grams in total.