Frequency
lists, n-grams, and customized data
While use is free
for all of the corpora, some
users may want to download frequency lists, n-grams, or customized data from
these corpora. This data can then be
used offline for
many different purposes, such as:
-
Developing teaching and testing
materials
-
Creating frequency-based
dictionaries and other lexicographical resources
-
Natural language processing
There are a number of different types
of frequency data that are available, including the following:
Type |
Explanation |
Sample files
(Click on links) |
Word/lemma |
The top 20,000 or 55,000 words
(depends on corpus), grouped by lemma (so go = go,
goes, went, etc). You can also obtain the frequency for
each individual word form (for goes, for went,
etc) of each lemma, and you can also have the frequency for
the lemma in each of the five major genres in the corpus.
(See links to the right for examples from COCA for these two
specialized lists, although similar lists can be created for
any of the corpora). |
|
N-grams |
The frequency of all two-word
(2-gram), three-word (3-gram), or other n-grams strings. With
these lists, you can quickly and easily find the frequency of
combinations of words across the corpus, without having to use the
corpus interface. In addition, you can specify for which words you
want n-grams (e.g. top 20,000 lemmas, top 10,000 NOUN+NOUN
cobinations, or or all words in your customized 30,000 word list). |
|
Sample sentences |
You can get any number of
sample sentences (with year, genre, and source) for any
number of words in lists that you send us. For example, we
recently created 3-6 sentences for 100,000+ words and phrases
for an online dictionary. These sentences are selected by
using collocates (with frequency and mutual information
score) to find good samples for each word. |
(From COCA) |
Other data |
If there is other data that you could use (without
having access to the full text: see
note
2 below), please
let us know. Examples might be the frequency of each word or phrase
in a 30,000 word/phrase list, or the frequency of all
synonyms for the top 10,000 lemmas in the corpus. |
|
The prices for these lists depends on
the corpus. For some lists (such as the 20,000 word lists from the
Corpus del Español and the Corpus do Português), they are only about
$200 for academic use. For the lists from COCA, and for the n-grams
and sample sentences, they are somewhat more. For the exact prices,
click on "MORE INFORMATION / FREQUENCY LISTS" at any of these
four corpora, or
contact
us.
Note 1: There are other sources for
frequency lists from the BNC (site
1,
site 2). If, however, you want
customized frequency lists (by PoS, by genre, etc), please feel free
to
contact us.
Also, please be aware that the wordlists are from a fifteen year old
corpus, and that they probably are not completely applicable to
American English (more
information: [MORE INFORMATION / COMPARE WORDLISTS TO BNC).
Note 2: We can provide nearly any type
of data you want -- with one exception. Because of serious copyright
issues, we cannot re-distribute the corpora in any format that would
allow end users to re-create even one entire article from the
original texts. Again, feel free to
contact us if you
have questions. |