English-Corpora.org


Note: this page was shared with Egbert, Larsson, and Biber in early March 2022, and they are aware of its content. Although they have reviewed this page (according to the email that I received back from them), they did not request any corrections to the points made here. Also, it is likely that an upcoming book by Egbert, Biber, and Gray (Cambridge, 2022) will repeat some of these same arguments made here, in which case that book would be flawed as well.

In the recent book Doing Linguistics with a Corpus (2020), Egbert, Larsson, and Biber (ELB 2020) spend about 12% of their book (9 of its 73 pages; pages 6-15) comparing COCA and the BNC, as a case study in "determining textual composition of available corpora". This page is a response to some of the issues and questions raised therein.

There are two main issues they raise: 1) we don't really know what is in COCA or how it was created 2) COCA Academic doesn't look as "academic" as BNC-Academic. I will discuss each of these two issues below.


The first set of concerns for the authors is that -- compared to the BNC -- COCA doesn't provide much data on how/why the (academic portion of the) corpus was designed/created in the way it was. The following are their concerns (stated in red):

1. "it would be very useful to review the contents of the texts themselves; however, the online version of COCA does not allow us to do so" (page 8). Our response:

  • The main page of COCA has a prominent link to downloadable COCA data from www.corpusdata.org. At that site, users can download each and every one of the 485,179 texts in COCA. (These include millions of words of data from free samples).

  • Any search in COCA can end up at an "expanded context" page that provides about 300 words of context. So even via the web interface, it is possible to reconstruct the data for a small number texts that someone might feel are "problematic".

  • Bottom line -- anyone who wants to see exactly what is in COCA can do so.

2. (After talking about how the BNC files include metadata in the text files themselves), they note that "The COCA_AC files contain no additional metadata. Each COCA_AC text file begins with a text ID that links them to the information we reviewed earlier" (page 10). Our response:

  • All of the metadata for COCA is easily available in a detailed spreadsheet that is prominently linked to from the home page of the corpus, and it has been that way since COCA was first released in 2008.

  • In the actual text files themselves (available from corpusdata.org) the text files have a textID, which links to the rich [sources] metadata

  • But yes, users need to link the [textID] in the text files (example, textID = 72839)  with the [textID] in the metadata (again, textID = 72839). But why is this a problem?

  • Linked data like this is the way that most large relational databases work (in part, because it provides much better speed). But with a simple one-line JOIN statement in SQL, all of the data can easily be together in one place, if users really need it to be.

  • It was more common to include metadata right in with the text -- back in the 1990s when corpora were a small fraction of the size of modern corpora. But that approach just doesn't work now that we have corpora with billions of words.

  • Bottom line -- it shouldn't be an issue that the texts and metadata are stored separately (but can be easily joined together, when needed). That's how things often work in large, modern corpora.

3. The authors note that in the BNC, each text was "carefully reviewed", whereas that is not the case for COCA (page 13). Our response:

  • The BNC has 501 texts in BNC-Acad (the academic part of the BNC, which is what they are interested in)

  • COCA has 26,137 texts in COCA-Acad (the academic part of COCA, which is what they are interested in)

  • We are not aware of any large corpus (with tens of thousands of texts in academic, for example) where each and every text has been carefully examined at a qualitative level.

  • We can either 1) stay with relatively small corpora (like the BNC) where each and every text has been carefully examined, or 2) create corpora like COCA and the other corpora from English-Corpora.org (or the corpora from Sketch Engine), which contain billions of words of data, but where each text is not manually examined qualitatively.

  • We unapologetically opt for #2, since it provides much richer data -- especially for lower and medium syntactic constructions (where the BNC might have just a handful of tokens), or for words (where the collocational data from a small corpus like the BNC is often very meager) -- and where there is little or no evidence that a "99.9% pure" corpus has significantly different data than a "100% pure" corpus (as if such a thing could exist).

4. The authors find 2 texts (from among the 26,137 academic texts in COCA), which (to them) do not look like they are peer-reviewed articles (pages 11-12). Our response:

  • Look at the metadata and google these two articles. They are in fact peer-reviewed. Problem solved.

  • But even if the articles weren't peer-reviewed (which they are), are just 2 out of 26,137 articles really significant? Or systematic?

  • To see if there are any systematic problems, try the following experiment:

  • Take 1% of the articles in COCA (about 261 articles = 1/100 of 26,137) absolutely at random from COCA-Acad -- for example, all texts with [textID] ending in [14] or [49], or any set of randomly selected texts

  • This is a small number (just 261 texts). That is small enough that you can go through these one by one and see how many of the 261 are not peer reviewed. If there are more than 2 or 3 (1% of the total) in this set of texts, send us an email (mark.davies@english-corpora.org). But there won't be.

  • Bottom line -- "there is no there there". COCA-Academic articles are peer-reviewed.

5. "it is difficult to be certain because the methods used to compile COCA-AC are not described in detail" (page 13)
     "we also encourage corpus creators to be more transparent in documenting their corpora. In the case of COCA-AC, we get ample detail about the sources for the texts, but we know little about the methods that were used to determine that those were appropriate for the corpus" (page 14). Our response:

  • A number of articles available online do discuss this, although perhaps not all in one single article

  • In defense of COCA, in the 12 years after it was was released by Mark Davies, he also released [1] TIME Magazine Corpus (2009), [2] Corpus of Historical American English (COHA) (2010), [3] Corpus of American Soap Operas (2012), [4] Global Web-Based English (GloWbE) (2013), [5] Wikipedia Corpus (2014), [6] CORE Corpus (2015), [7] Corpus of US Supreme Court Opinions (2015), [8] News on the Web (NOW) (2016), [9] iWeb: The Intelligent Web-based Corpus (2017), [10] The TV Corpus (2019), [11] The Movie Corpus (2019), [12] Coronavirus Corpus (2020); architectures and interfaces for [13] Strathy Corpus (Canada) (2012), [14] Hansard Corpus (2015), [15] Early English Books Online (2015); as well as non-English corpora like [16] Corpus del EspaŮol (Web-Dialects) (2016), [17] Corpus do PortuguÍs (Web-Dialects) (2016), [18] Corpus del EspaŮol (NOW) (2016), [19] Corpus do PortuguÍs (NOW) (2016).
    Nineteen corpora in 12 years, and most of them created by just one person. And almost every one of these corpora is larger than the BNC, and some are 10-15 billion words in size. Jointly, these corpora have been used as the basis for thousands of articles and books, and these corpora are the most widely-used corpora in existence. I have also created resources like the Academic Vocabulary List, full-text corpus data, n-grams, collocates, and the most carefully-corrected large frequency lists of English -- and these resources have been used my hundreds of companies and thousands of universities. Finally, I have been the sole administrator for all of these corpora -- which involves hundreds of emails each month to answer questions from users. And all of this is in addition to research and teaching (5-6 classes each year). So yes, I've been rather busy.

  • In the 12 years after the BNC was released, the BNC team (several people) also released  .  .   .  well actually, they didn't release any other corpora in the 12 years from 1993-2005.

  • So our apologies if we weren't able to find the time to write up nice, long qualitative overviews of COCA, in the same way as the BNC people -- who certainly had much more leisure time on their hands.

  • Also, if you have questions about COCA -- just ask!! I answer all emails (mark@mark-davies.org, or mark.davies@english-corpora.org, or mark.davies@corpusdata.org) within 24 hours, usually within a couple of hours. There is no need to suggest that I'm trying to hide anything about the creation of COCA. Just ask -- before you publish an article stating that we don't have much information about how COCA was created.

  • Finally, the criticism about COCA could be made for other corpora -- like the Longman Corpus, which is used for the Longman Grammar of Spoken and Written English (Biber et al 1998). Which article describes in detail how this corpus was constructed (in terms of qualitative criteria, at the level of individual texts)? And is this corpus publicly available, for users to examine each of the texts qualitatively (as it is with COCA)?

  • Also, are there any publicly-available corpora from either Egbert or Larsson, which we can also examine in detail, to see practical examples of how corpora should be created? (And not the CORE Corpus, which some articles by Egbert and Biber suggest was created by them, but which was in fact created by Mark Davies.) It's nice to write books about designing and creating corpora. It's even better to actually create one.


The second main argument is that COCA-Academic (COCA-Acad) isn't really as "academic" as BNC-Academic (BNC-Acad). There are three pieces of supposed evidence they use:

1. "Linking adverbials" (however, thus, therefore, etc) are a feature of academic English, but (according to ELB 2020) these aren't as frequent in COCA-Acad as in BNC-Acad (page 11).
     They suggest that these adverbials are about 41% more frequent in the BNC-Acad than COCA-Acad (2,984 tokens per million words in the BNC vs 2,121 tokens per million words in COCA)
     Our response:

  • These adverbials are more frequent in the BNC, but -- as we will see -- it has little if anything to do with genre

  • The authors are comparing "apples to oranges" -- a corpus from 30 years ago (the BNC) to a much more updated corpus (COCA)

  • There is a 37% decrease in linking adverbials in COCA (for all genres) from the early 1990s to the early 2010s (the search is for any punctuation followed by one of these nine adverbials)

  • In COHA, there is a 65% decrease with these adverbials -- again, across all genres -- from the 1980s (the decade for many of the texts in the BNC) to the 2000s

  • So if we see a 40% decrease from the 1980s/early 1990s (BNC) to all of COCA academic (which goes through 2019), this can be accounted for by an overall decrease with these adverbials in American English

  • In addition to historical change, there is a dialectal difference between British and American English. These adverbials are about 13% more common in the UK (GB) than in the US -- in GloWbE (based on web pages).

  • In summary, adverbials are about 40% more common in 1980s/early 1990s British English (BNC) than in COCA (US, 1990-2019). But historical change accounts for at least a 36% difference, and then dialects another 13%.

  • If you take these two factors into account (historical change and dialectal differences), one could take the argument that linking adverbials are actually more common in COCA-Acad than in BNC-Acad.

2. They also look at nominalizations (*tion, *ism, *ence, etc). They actually find that COCA-Acad does look a lot like the BNC, in terms of frequency -- 35,613 tokens (per million words) in the BNC vs 33,636 in COCA -- a difference of (just) 6% (page 11).
     Our response:

  • Surprisingly, there is a decrease overall in nominalizations (at least in American English) from the 1980s/early 1990s (when the BNC was created) through the 2010s (when COCA ends), as seen below

  • Looking just at academic in COCA, for example, there is a 7% decrease with *tion nouns from 1990-94 to 2010-2014

  • Looking at all genres in COCA, there is a 12% decrease with *tion nouns during the same period

  • Looking at all genres in COHA, there is a 14% decrease with *tion nouns from the 1980s (the date of many texts in the BNC) to the 2000s

  • So again, the authors are confusing genre differences with historical change. Nominalizations decreased anywhere from 7-14% from the time of the BNC to the end of COCA. That alone accounts for more than the 6% difference they see in the BNC (a corpus that is now 30+ years old) and the much more recent COCA.

  • In fact, taking historical change into account, it is entirely possible that nominalizations are actually more common in COCA-Acad than in BNC-Acad.

3. Perhaps the strangest of the three pieces of "evidence" for the fact that the academic texts in the BNC are "more academic" than those in COCA is the fact that one word -- intestine -- this one single word -- is more common (per million words) in the BNC than in COCA (page 12-13).
     Our response:

  • Why focus on this one word -- intestine? How many other words did the authors look at (which were more common in COCA academic) before they came upon this one word that is more common in the BNC? Certainly we can be more scientific and systematic than that.

  • In order to look at this systematically, we took all of the words in the Academic Word List (AWL; Coxhead, 2000) -- a "neutral" word list -- and compared the frequency of each of these words in COCA and the BNC.

  • The results are found in this spreadsheet (ZIP, TXT)

  • If we look at all words, there are 310 words that are more frequent in COCA (in other words, the normalized frequency (per million words) is at least 1% more in COCA than the BNC).

  • There are 254 words that are more frequent in the BNC (i.e. where COCA is at least 1% less frequent than the BNC)

  • But a difference of 1% either way probably doesn't matter much. So let's limit it to words that are at least 50% more frequent in either COCA or the BNC (e.g. 12 tokens per million words in COCA, but 7 tokens pmw in the BNC), but which are "still in the ballpark" in terms of frequency (in other words, not more than 20x as frequent in one of the two corpora). In this case:

  • As the spreadsheet above shows, there are 142 words (yellow) that are more frequent in COCA

  • There are 97 words (blue) that are more frequent in the BNC

  • So using the criteria of word frequency -- but going beyond the one word (intestine) that the authors use -- to look at all words in the AWL -- the frequency of academic words is much higher in COCA than the BNC.

But the fact of the matter is that both the BNC and COCA have very good academic sections. So it really doesn't make sense to argue that one is somehow "better" than the other. But what is definitely certain is that COCA is not inferior to the BNC, at least using the three criteria that the authors have chosen to use.


One final issue concerns COCA and the BNC, in more general terms:

  • There are other corpora besides the BNC. The BNC is a great corpus, and it has served as the inspiration for many other corpora, including COCA. But after 30 years, isn't it time for even BNC-philes to admit that the BNC doesn't "have a corner on the market" any more, and that other corpora are at least equally as useful for researchers and language learners? The "BNC-only" attitude was extremely prevalent when I got into English corpus linguistics in 2003 and I thought that it would subside with time. But in some corners, it's just as prevalent now (almost 20 years later) as it was then.

  • Practical considerations. This is particularly important when one considers that the BNC was in essence a "once-off" corpus -- lots of money and lots of researchers working on the corpus -- for a very long time. This is not the way things work for 99% of corpus creators. (See page 76 here; it was true when that article was published in 2011, and it's just as true today) When people speak from ivory towers about how a corpus has to be "practically perfect in every way", and how corpora aren't "real corpora" unless they look and act just like the BNC, they inhibit innovation and creativity from people who are using different approaches. It would be very sad if in 2030 or 2040 we were still expecting corpora to look just like the BNC -- from 40-50 years earlier.

  • It doesn't have to cost an arm and a leg. To give another example -- in a recent article (IJCL, 2021) we discuss the 525 million word TV and Movies corpora (part of which are from the US and are in COCA). In addition to being about 25 times as big as the spoken part of the BNC (including the 2014 update) -- the TV and Movies corpora actually have language that looks very similar to the spoken texts in the BNC. There was incredible pressure on the IJCL from people associated with the BNC to not publish this article (or at least to water it down significantly), since the article suggested that maybe we didn't need huge research teams and years of effort after all, to create corpora that provide great insight into informal language (after all, it took one person only about $0 and 4-5 weeks to create the TV and Movies corpora). The same thing happened with our first article on COCA a-- lots of pressure from the BNC crowd not to publish that (since the first version of COCA in 2008 also cost about $0 to create, and was created by one person in 4-5 months). But kudos to editor Michaela Mahlberg for ignoring them anyway, and to other editors since then. Bottom line -- people who are making lots of money and having cushy multi-year positions to create corpora want to convince others that that it has to be that way and that any corpora that don't follow their way of doing things are probably "defective" in some way. But this just isn't true.

  • A corpus is much more than just the texts. We haven't even addressed the question of how a corpus is so much more than just the texts. Some of the best "textual" corpora in the world (i.e. great representativeness) have architectures and interfaces that make them almost unusable. The corpora from English-Corpora.org (including COCA, of course) offer such a wide range of functionality -- much more than the strictly grammatically-oriented searches of the 1980s and 1990s (more modals, anyone?). Why not include these useful search features as part of the calculation when discussing the question: "what is a useful corpus"?

  • And finally, "the proof of the pudding is in the eating". COCA has been purchased and used by many hundreds of companies, including Amazon, Facebook, Google, Microsoft, IBM, Sony, Disney, Intel, Adobe, and Samsung; and language-related companies like Merriam-Webster, Dictionary.com, Grammarly, Duolingo, TurnItIn, Oxford University Press, Sketch Engine; and many more. Would these companies have purchased and used COCA if it wasn't an incredibly useful research tool? And COCA online is arguably the most widely-used corpus (and while of course there are multiple versions of the BNC online, at English-Corpora.org, COCA is used 5-6 times as much as the BNC). And finally, COCA has served as the basis for 3,500+ publications since 2010 (and most of these are from the last 5-6 years). People use COCA so much simply because it provides such good data for such a wide range of researchers, teachers, and students. That is the bottom line in terms of "doing linguistics with a corpus".