COCA / BNC: Academic

In the recent book Doing Linguistics with a Corpus (2020), Jesse Egbert, Tove Larsson, and Douglas Biber (hereafter ELB 2020) spend about 12% of their book (9 of its 73 pages; pages 6-15) comparing the COCA and BNC corpora. And then Jesse Egbert and Douglas Biber (this time with Bethany Gray) follow that book with the 2022 book Designing and Evaluating Language Corpora (Cambridge Univ Press, 2022; hereafter EBG 2022), which discusses COCA throughout the book.

One of the main critiques of COCA in ELB (2020) and EGB (2022) is that there are serious problems with the academic texts in COCA, whereas the academic texts in the BNC are much more reliable. We will present the three main arguments that they make against COCA academic in ELB (2020) and then the two arguments that they make against COCA academic in EBG (2022). As we will see, the data does not support their arguments at all. If anything, the data shows that it is COCA that has the more reliable academic texts.

Egbert, Larsson, Biber (2020)

1. They argue (page 11, et passim) that linking adverbials (however, thus, therefore, etc) are a feature of academic English, but that these are not as frequent in COCA-Acad as in BNC-Acad. They therefore argue that COCA-Acad is somehow "defective".

Our response: ELB is not comparing "apples to apples", in terms of time period or dialect. Data from COHA (the Corpus of Historical American English) shows that these linking adverbials were much more common in American English generally 30-35 years ago (when the BNC was created) than they are now, and COCA shows a sharp decline from 1990-2019. In addition, in GloWbE (the Global Web-based English corpus) these adverbials are more common generally in British than American English. When we take into account the historical changes and the dialectal differences, these linking adverbials are actually more common in COCA-Acad than BNC-Acad.

2. They also look at nominalizations (*tion, *ism, *ence, etc), which are also a feature of academic English. The frequency in BNC-Acad is 35,613 tokens (per million words) vs 33,636 in COCA-Acad. Again, they argue that COCA-Acad is somehow "defective".

Our response: Is this difference of just 6% really significant? Also, nominalizations have decreased generally in American English in the 30-40 years since the 1980s, when the BNC was created (COHA, COCA). If that overall change in American English (not just in academic) is taken into account, nominalizations are actually more common in COCA-Acad than in BNC-Acad.

3. Perhaps the strangest of the three pieces of "evidence" for the supposed weakness of COCA is their argument that one word – intestine –is more common (per million words) in BNC-Acad than in COCA-Acad. They therefore argue that on lexical grounds, the BNC has better / more representative "academic" than COCA.

Our response: Frankly, it is silly to focus on just one word (intestine) and – based on that one word– to suggest that BNC-Acad is somehow better.

In order to look at this systematically, we took all 568 of the words in the Academic Word List (AWL; Coxhead, 2000) – a "neutral" word list that would favor neither the BNC nor COCA – and compared the frequency of each of these words in the AWL list in both COCA and the BNC. The results are found in this spreadsheet (ZIP, TXT).

If we look at all 568 words in AWL, there are 310 words that are more frequent in COCA; in other words, the normalized frequency (per million words) is at least 1% more in COCA than the BNC. There are 254 words that are more frequent in the BNC; i.e. where COCA is at least 1% less frequent than the BNC.

( But a difference of 1% either way probably doesn't matter much. So let's limit it to words that are at least 50% more frequent in either COCA or the BNC (e.g. 12 tokens per million words in COCA, but 7 tokens pmw in the BNC), but which are "still in the ballpark" in terms of frequency (in other words, not more than 20x as frequent in one of the two corpora). In this case, there are 142 words (yellow) that are more frequent in COCA, and 97 words (blue) that are more frequent in the BNC.)

So using the criteria of word frequency -- but going beyond the one word (intestine) that Egbert, Larsson, and Biber use – to look systematically at all words in the AWL – the relative frequency of academic words is higher in COCA-Academic than in BNC-Academic.

Egbert, Biber, Gray (2022)

I provided detailed data to Jesse Egbert and Douglas Biber to show that in all three cases in ELB (2020) the truth is exactly the opposite of what they claim, and that COCA academic is actually more "academic" than BNC academic. I emailed the results of these studies to the three authors, and they emailed back to say that they received it. However, the authors chose not to include any of these three supposed pieces of “evidence” in their 2022 book. Apparently it was okay to include these tests when they supposedly favored the BNC, but they needed to be removed once it was shown that they favored COCA.

Rather than re-use any of these three tests in the 2022 book, the authors found two new tests that (they claimed) showed that BNC academic is “better” academic than COCA academic.

4. The first is "noun complements" . In noun complements, (for example, the claim [ that this research is valid ] ), the head noun (e.g. claim) does not play a role in the subordinate clause. Compare this to a sentence like the man [ that you saw the man ], where the head noun does play a role in the subordinate clause (as object in this case).

Our response: Unfortunately, the authors never provide a list of the head nouns that are involved in their BNC/COCA study of noun complements, and so it is impossible to confirm their claims regarding that construction.

5. The authors also look at the NOUN+NOUN construction (e.g. data collection, case study), which is a feature of academic English. They assert that this construction is more common in the BNC (pages 211-215).

Our response: Actual data from COCA and BNC clearly disproves the claim made in the book. The following is the frequency of NOUN+NOUN in the academic portion of both COCA and the BNC. (Click on the links for a list of NOUN+NOUN in the two corpora. (These lists show all NOUN+NOUN strings that occur five times or more in the 120 million word academic portion of COCA corpus, and two times or more in the 15 million word academic portion of the BNC.)

  # words in academic # NOUN + NOUN NOUN + NOUN per million words
COCA 119,790,456 4,225,693 0.035
BNC 15,331,668 360,362 0.023

As can be seen, NOUN+NOUN is more than 50% more common (per million words) in COCA than in the BNC. Following their criteria, COCA has “better” academic.

5a. Related data also shows that the lexis (words) in BNC academic seems to be quite “skewed” towards medical texts. For example, each of the following phrases is in the top 35 NOUN+NOUN sequences in the BNC: gall bladder, bowel disease, biopsy specimens, acid secretion, bile acid, and plasminogen activator. The academic genre should have texts from a wide range of domains – like science, law, philosophy, humanities, history, education, etc. (COCA is balanced between these domains.) Why the serious skewing towards the medical domain in the BNC? EGB (2022) focuses almost entirely on syntactic issues rather than lexical issues, so perhaps these seriously skewed texts escaped their notice.

And just to summarize, the following are the results of the five tests that the authors mention in the 2020 and 2022 books, in their comparison of BNC and COCA academic:

  Book  Phenomenon  BNC vs COCA: which is more “academic”
1 2020 Linking adverbials They say BNC, but it is COCA when adjusted for time/dialect (#1)
2 2020 Nominalizations They say BNC, but it is COCA when adjusted for time/dialect (#2)
3, 5a 2020/2022 Lexical (including NOUN + NOUN) Clearly COCA
4 2022 Noun complements Possibly BNC (but no clear explanation of data in book)
5 2022 NOUN + NOUN Clearly COCA

Based on the data that the authors themselves provide in their books, COCA academic is probably more representative of what we would expect from academic than is the academic of the BNC.