COCA / BNC: Academic
In the recent book
Doing Linguistics with a Corpus (2020), Jesse Egbert, Tove Larsson, and
Douglas Biber (hereafter ELB 2020) spend about 12% of their
book (9 of its 73 pages; pages 6-15) comparing the COCA and BNC
corpora. And then Jesse Egbert and Douglas Biber (this time with Bethany Gray)
follow that book with the 2022 book
Designing and Evaluating Language Corpora (Cambridge Univ Press,
2022; hereafter EBG 2022), which discusses COCA throughout the book.
One of the main critiques of COCA in ELB (2020) and EGB (2022) is that there are serious problems with
the academic texts in COCA, whereas the academic texts in the BNC
are much more reliable. We will present the three main arguments
that they make against COCA academic in ELB (2020) and then the two
arguments that they make against COCA academic in EBG (2022). As we
will see, the data does not support their arguments at all. If
anything, the data shows that it is COCA that has the more reliable
academic texts.
Egbert, Larsson, Biber (2020)
1. They argue (page 11, et
passim) that linking adverbials (however, thus, therefore,
etc) are a feature of academic English, but that these are not as
frequent in COCA-Acad as in BNC-Acad. They therefore argue that
COCA-Acad is somehow "defective".
Our response: ELB is not
comparing "apples to apples", in terms of time period or dialect.
Data from COHA (the Corpus of Historical American
English) shows that these linking adverbials were much more common
in American English generally 30-35 years ago (when the BNC was
created) than they are now, and
COCA shows a sharp decline from 1990-2019. In addition, in
GloWbE (the Global Web-based English corpus) these adverbials are
more common generally in British than American English. When we
take into account the historical changes and the dialectal
differences, these linking adverbials are actually more common in
COCA-Acad than BNC-Acad.
2. They also look at
nominalizations (*tion, *ism, *ence, etc), which are also a
feature of academic English. The frequency in BNC-Acad is 35,613
tokens (per million words) vs 33,636 in COCA-Acad. Again, they argue
that COCA-Acad is somehow "defective".
Our response: Is this difference
of just 6% really significant? Also, nominalizations have decreased
generally in American English in the 30-40 years since the 1980s,
when the BNC was created (COHA,
COCA). If that overall change in American English (not just in
academic) is taken into account, nominalizations are actually more
common in COCA-Acad than in BNC-Acad.
3. Perhaps the strangest of the
three pieces of "evidence" for the supposed weakness of COCA is
their argument that one word – intestine –is
more common (per million words) in BNC-Acad than in COCA-Acad. They
therefore argue that on lexical grounds, the BNC has better / more
representative "academic" than COCA.
Our response: Frankly, it is
silly to focus on just one word (intestine) and – based on
that one word– to suggest that BNC-Acad is somehow better.
In order
to look at this systematically, we took all 568 of the words in the
Academic Word List (AWL; Coxhead, 2000) – a "neutral" word list
that would favor neither the BNC nor COCA – and compared the
frequency of each of these words in the AWL list in both COCA and
the BNC. The results are found in
this spreadsheet (ZIP,
TXT).
If we look at all 568 words in AWL,
there are 310 words that are more frequent in COCA; in other words,
the normalized frequency (per million words) is at least 1% more in
COCA than the BNC. There are 254 words that are more frequent in the
BNC; i.e. where COCA is at least 1% less frequent than the BNC.
( But a difference of 1% either way
probably doesn't matter much. So let's limit it to words that are at
least 50% more frequent in either COCA or the BNC (e.g. 12 tokens
per million words in COCA, but 7 tokens pmw in the BNC), but which
are "still in the ballpark" in terms of frequency (in other words,
not more than 20x as frequent in one of the two corpora). In this
case, there are 142 words (yellow) that are more frequent in COCA,
and 97 words (blue) that are more frequent in the BNC.)
So using the criteria of word frequency
-- but going beyond the one word (intestine) that Egbert,
Larsson, and Biber use – to look systematically at all words in the
AWL – the relative frequency of academic words is higher in
COCA-Academic than in BNC-Academic.
Egbert, Biber, Gray (2022)
I provided detailed data to Jesse
Egbert and Douglas Biber to show that in all three cases in ELB
(2020) the truth is the opposite of what they claim. They
emailed back to say that they had received this information. However, the authors
chose not to include any of these three supposed pieces of
“evidence” in their 2022 book. Apparently it was okay to include
these tests when they supposedly favored the BNC, but they needed to
be removed once it was shown that they favored COCA.
Rather than re-use any of these three
tests in the 2022 book, the authors found two new tests that
(they claimed) showed that BNC academic is “better” academic than
COCA academic.
4. The first is "noun complements"
. In noun
complements, (for example, the claim [ that this research
is valid ] ), the head noun (e.g. claim) does not play a
role in the subordinate clause. Compare this to a sentence like
the man [ that you saw the man ], where
the head noun does play a role in the subordinate clause (as object
in this case).
Our response: Unfortunately, the authors never provide a list of
the head nouns that are involved in their BNC/COCA study of noun
complements, and so it is impossible to confirm their claims
regarding that construction.
5. The authors also look at the
NOUN+NOUN construction (e.g. data collection, case study),
which is a feature of academic English. They assert that this
construction is more common in the BNC (pages 211-215).
Our response: Actual data
from COCA and BNC clearly disproves the claim made in the book. The following is the frequency of NOUN+NOUN in the
academic portion of both COCA and the BNC. (Click on the links for a
list of NOUN+NOUN in the two corpora. (These lists show all
NOUN+NOUN strings that occur five times or more in the 120 million
word academic portion of COCA corpus, and two times or more in the
15 million word academic portion of the BNC.)
|
# words in academic |
# NOUN + NOUN |
NOUN + NOUN per million words |
COCA |
119,790,456 |
4,225,693 |
0.035 |
BNC |
15,331,668 |
360,362 |
0.023 |
As can be seen, NOUN+NOUN is more than
50% more common (per million words) in COCA than in the BNC.
Following their criteria, COCA has “better” academic.
5a. Related data also shows that
the lexis (words) in BNC academic seems to be quite “skewed”
towards medical texts. For example, each of the following phrases is
in the
top 35 NOUN+NOUN sequences in the BNC: gall bladder, bowel
disease, biopsy specimens, acid secretion, bile acid, and
plasminogen activator. The academic genre should have texts from
a wide range of domains – like science, law, philosophy, humanities,
history, education, etc. (COCA is balanced between these
domains.) Why the serious skewing towards the medical domain in the
BNC? EGB (2022) focuses almost entirely on syntactic issues rather
than lexical issues, so perhaps these seriously skewed texts escaped
their notice.
And just to summarize, the following
are the results of the five tests that the authors mention in the
2020 and 2022 books, in their comparison of BNC and COCA academic:
|
Book |
Phenomenon |
BNC vs COCA: which is more
“academic” |
1 |
2020 |
Linking adverbials |
They say BNC, but it is COCA when adjusted for
time/dialect (#1) |
2 |
2020 |
Nominalizations |
They say BNC, but it is COCA when adjusted for
time/dialect (#2) |
3, 5a |
2020/2022 |
Lexical (including NOUN + NOUN) |
Clearly COCA |
4 |
2022 |
Noun complements |
Possibly BNC (but no clear explanation of data in book)
|
5 |
2022 |
NOUN + NOUN |
Clearly COCA |
Based on the phenomena that the authors
themselves provide in their books, COCA academic is probably
more representative of what we would expect from academic
than is the academic sub-corpus in the BNC.
|