|
Corpus of Contemporary American English |
American National Corpus 2 |
|
Size |
560+ million words 1 |
22 million words |
Dates |
1990 - 2017 |
1990 - ?? 3 |
Date distribution |
20 million words each year |
0.5-3 million |
Updated |
Yes, every year or two |
No (??) 4 |
Availability / price |
Free access (but only via web interface) |
Free (via Open ANC), or DVD ($75, from the LDC). Full text access. |
Spoken |
118 million words (4m each year,
1990-2017) |
4 million words |
Fiction |
113 million words (4m each year,
1990-2017) |
0.5 million words |
Magazines |
118 million words (4m each year,
1990-2017) |
5 million words |
Newspapers |
114 million words (4m each year,
1990-2017) |
4 million words |
Academic |
112 million words (4m each year,
1990-2017) |
4 million words |
Other text types |
3 million words: Blog (Buffy the
Vampire Slayer) |
Notes
1 The Corpus of Contemporary American English
contained about 365 million words
in size when it was released in early 2008 (20 million words each year,
1990-2007). As of Dec 2017, it has more than 560 million words. It will continue
to grow by 20 million words each year.
2 Refers to the Second Release (2005) of the American National Corpus. There has
not been a Third Release since that time.
3 This is probably a function of whether/when the ANC is completed
4 The ANC was projected to have 100 million words upon completion in c2005. No
plans have been announced to expand the corpus beyond that size, if/when the
corpus is completed.
With such a difference in the composition of the two corpora, it is not surprising that the two corpora yield very different data. In this section, we compare the 5,000-10,0000 most frequent words in the two corpora, but similar comparisons could be made for syntax, semantics, etc.
+COCA / -ANC
About 20-25% of the words in the top 5,000 COCA wordlist are not in the ANC list. In other words, of the top 5000 lemmas in COCA, the word is at least twice as infrequent in the ANC list (e.g. COCA #4000, ANC #8000 or lower). Things get much, much messier at lower levels, where the ANC lists will be missing 50-60% of the words in the COCA lists.
The following words are examples. These words are in the top 3000-4000 words in COCA, but (in this case) they are at least four times farther down the list (for example, #2000 in COCA, #9000 in the ANC). As one can see, these are full of "everyday" words:
Adjectives: left, far, concerned, involved, supposed, Christian, growing, clean, alone, married, Catholic, English, used, surprised, spiritual, existing, living, fun, remaining, leading
Nouns: university, back, data, American, Republican, congress, south, east, Democrat, troop, institute, Christmas, learning, sir, fat, Jew, e-mail, academy, Indian, navy, teen, pine, Muslim, Olympics, handle
Verbs: need, stand, thank, lay, laugh, shake, smile, stare, drink,
lift, grab, lean, nod, stir, dance, bend, slide, kiss, whisper, glance,
pray, wave, bake, pause, shrug, cope, brush, sigh, excuse, hurry, burst,
spill, hug, blend
(Note that many of these words come from fiction and from "popular magazines". They occur very infrequently in the ANC, since the ANC has essentially no texts from fiction or popular magazines. COCA, on the other hand, has
230+ million words from these genres).
+ANC / -COCA
On the other hand, there are about 20-25% of the words in the ANC top 5000 list that are not in the COCA list, and things are much messier for lower frequency words. The following are words in the top 5000 words in the ANC list, which are at least four times less common in COCA (e.g. ANC #2000, COCA #9000). As one can see, they are either errors (bad part of speech or lemma) in the ANC, or are a function of the skewed text composition of the ANC (apparently, lots of academic journal articles on DNA sequencing):
Adjective: uh-huh, um-hum, binding, e-mail, amino, conserved, mutant, genomic, molecular, incubated, viral, wild-type, purified, bye-bye, cultured, locus, correlated, putative, phylogenetic, endogenous, cytoplasmic, downstream, mammalian, catalytic, sequenced, transfected, recombinant, transgenic, terminus, gene-expression, eukaryotic
Noun: yeah, um, cell, gene, datum, protein, sequence, gonna, tissue, acid, receptor, genome, mutation, tumor, huh, www, probe, cdna, mhm, mrna, clone, assay, membrane, activation, transcription, chromosome
Verb: accord, detect, induce, calculate, isolate, label, activate, usee, controll, bind, stain, clone, cluster, inhibit, code, underlie, rang, amplify, overlap, school, sequence, encode, splice