|
Word (PoS) |
COCA freq |
BNC freq |
collocate |
COCA (click to see) |
BNC (click to see) |
click (noun) |
3145 | 445 |
adj |
||
nibble (verb) |
1194 | 244 |
noun |
||
serenely (adv) |
308 | 83 |
verb |
||
crumbled (adj) |
446 | 27 |
noun |
0 |
Syntax. Consider the following three examples.
[like] for [p*] to
[v*] (I’d really like for you to stay)
There are
5 tokens in the
BNC, but
352 tokens in COCA. With the
BNC there aren't enough examples to see if this is a feature of
informal or formal English, but the data from COCA show that it
is clearly
a feature of
spoken English. The data also shows that it is
increasing slowly over time, when compared as a ratio to the
construction [
like --
him to V ].
Is it excel in V-ing,
or excel at V-ing ? (she excels in/at playing the
piano)
Granted, this is a very narrow issue, but it is precisely the
thing that translators and non-native speakers are interested in.
With the BNC there are 5 tokens with
at and 6 with
in -- probably not
enough to say which is more common. In COCA, however, there are
136 with
at and
47 with
in. This is
enough to begin to see which genres prefer one or the other, as
well as which subordinate clause verbs occur with each. Such
granularity is not possible with the BNC.
[have] been being [vvn]
(she had been being watched)
There are
2 tokens in the BNC (1 spoken, 1 fiction),
and this is not enough data to see any possible genre variation.
In COCA, on the other hand, there are
14
tokens (10 spoken, 2 fiction, 1 news, 1 academic). This is enough to
show that this is a
feature of
spoken English, and the data also shows that it is
increasing since 1990. (By the way, most native speakers of both
dialects will cringe at sentences like this, but they are
in the corpora.)
In summary, while 100 million words is often adequate for studying syntax, for some very low-frequency phenomena, there is a real difference between 100 million words (BNC) and 560 million words (COCA).
How up-to-date are the corpora?
COCA has 20 million words in each year since the early 1990s (for a total of more than 520 million words total since the early 1990s), and the most recent texts are from December 2017. The BNC was created in the late 1980s and was released in the early 1990s, and there was an update three years ago in 2014. This has important implications in terms of how the two corpora represent contemporary English.
Lexical. Perhaps the easiest comparison deals with words that have recently come into English, or which are used a lot more now than 20-25 years ago. The following lists show a few words (just a tiny sample of all such words) that are found less than half as often in the BNC than in COCA (per million words), and the words in italics are found less than 10% as often (often, there are no tokens in the BNC). Obviously, some are American words and wouldn't be in a corpus of British English. Many others, however, are words that are simply much more common in COCA, because it is more recent.
Noun: website (COCA/BNC), blog (COCA/BNC), globalization/globalisation (COCA/BNC), SUV, RPG, Taliban, e-mail, anthrax, recount, adolescent, prep, tsunami, affiliation, Sunni, insurgent, insurgency, terrorism, coping, terrorist, cleric, yoga, homeland, genome, steroid, detainee, militant
Adjective: same-sex (COCA/BNC), Islamist (COCA/BNC), upscale (COCA/BNC), terrorist, faith-based, web-based, nonstick, dot-com, performance-enhancing, high-stakes, 21st-century, old-school, pandemic, iconic, insurgent, online, broadband, gated, wireless, clueless
Adverb: wirelessly (COCA/BNC), healthfully (COCA/BNC), multiculturally (COCA/BNC), preemptively, inferiorly, counterintuitively, online, forensically, intraoperatively, postoperatively, famously
Verb: mentor (COCA/BNC), morph (COCA/BNC), download (COCA/BNC), e-mail, makeover, prep, upload, workout, freak, transition, vaccinate, encrypt, reconnect, click, host, splurge, preheat, co-write, outsource, snack, partner
Although we have focused just on new "words" here, the same thing holds for other areas of language -- morphology (word formation), syntax (grammar), and semantics (word meaning, such as green = "environmentally friendly"), or discourse analysis (what we are saying about immigrants, or women, or the environment). Any changes that have occurred since the early 1990s may not show up in BNC, but should be modeled quite nicely with COCA.
Genre balance
The BNC is 10% spoken / 90% written, while in COCA the corpus is nearly evenly divided (20% in each genre) between spoken, fiction, popular magazines, newspaper, and academic.
GENRE | COCA (millions of words) | BNC (millions of words) |
Spoken | 118 | 10 |
Fiction | 113 | 17 |
Popular magazines | 118 | 16 |
Newspaper | 114 | 11 |
Academic | 112 | 16 |
Other | 30 |
The BNC has a much wider range of spoken sub-genres, while COCA is
composed of unscripted conversation on TV and radio shows
(See notes on the naturalness of these conversations:
COCA /
Help-Information / Texts / Spoken).
Both corpora are very well balanced in terms of sub-genres for the
written genres (e.g. Newspaper-Sports, or Academic-Medicine). In
addition, because there is a diachronic aspect to COCA
(coverage over time), in COCA the distribution of 20% in each of the
five genres stays constant from year to year.
Summary
COCA and the BNC complement each other nicely, and they are are only large, well-balanced corpora of English that are publicly-available. The BNC has better coverage of informal, everyday conversation, while COCA is much larger and more recent, which has important implications for the quantity and quality of the data overall.
Unless one is inherently interested in only British or American English, there is really no reason to not take advantage of both corpora. This is especially true when -- as with the interface at www.english-corpora.org -- both corpora can be used side-by-side, with the same interface. For most types of studies, academic publications and presentations that rely on just the BNC for data from Modern English will look increasingly outdated and insular as time goes on.