In addition to the regular corpus interface, there are a wide range of
other corpus-based resources, some of which allow you to download large amounts of
data for offline use. (Compare to academic
license)
Full-text |
Download full-text data for iWeb, COCA, COHA, GloWbE, NOW, Coronavirus, Wikipedia,
SOAP, the TV Corpus, the Movies Corpus. With this data, you will have the texts
from the corpora
on your own computer, rather than having to use the web interface. The
data comes in three formats: relational database, word/lemma/PoS
(vertical format), or text (linear format). |
Word Frequency |
Download lists of the top 60,000 lemmas in COCA, including the
frequency by the eight main genres and nearly 100 sub-genres. You
can also download a list with the frequency of the word forms (e.g.
decide, decides, deciding, decided), as well as a list of the
top 219,000 words (not lemmas) in COCA, including frequency by
genre. |
Collocates |
Download lists with the top 200-300
collocates (nearby words) for 60,000 different lemmas -- 13,500,000 node/collocate
pairs in all. |
N-grams |
Download lists (in various formats) of all 2, 3, 4, and 5-word
strings that occur at least four times in COCA -- more than 40
million n-grams in total. |
|
|