In addition to the regular corpus interface, there are a wide range of
other corpus-based resources, some of which allow you to download large amounts of
data for offline use. (Compare to academic
Download full-text data for iWeb, COCA, COHA, GloWbE, NOW, Coronavirus, Wikipedia,
SOAP, the TV Corpus, the Movies Corpus. With this data, you will have the texts
from the corpora
on your own computer, rather than having to use the web interface. The
data comes in three formats: relational database, word/lemma/PoS
(vertical format), or text (linear format).
Download lists of the top 60,000 lemmas in COCA, including the
frequency by the eight main genres and nearly 100 sub-genres. You
can also download a list with the frequency of the word forms (e.g.
decide, decides, deciding, decided), as well as a list of the
top 219,000 words (not lemmas) in COCA, including frequency by
Download lists with the top 200-300
collocates (nearby words) for 60,000 different lemmas -- 13,500,000 node/collocate
pairs in all.
Download lists (in various formats) of all 2, 3, 4, and 5-word
strings that occur at least four times in COCA -- more than 40
million n-grams in total.
Download free lists from the 120
million words of COCA-Academic texts, including
academic words grouped by word families, lists of "core"
academic English, and "technical" word lists for the nine
domains of COCA-Academic (e.g. Law, Medicine, or Business).
All of the resources listed above are for COCA and other "smaller"
corpora (e.g. 100 million - two billion words in size). You can also
access data from the 14 billion word iWeb corpus, which has its own