Note: this page makes several references to
Egbert, Larsson, Biber (2020) and
Egbert, Biber, Gray (2022)
(more). For
convenience, we will refer to these as ELB 2020 and EBG
2022.
1. What is representativity? "Representativity" refers to how
well the texts in a corpus represent what is "going on in the
real
world". For example, if a word or phrase or grammatical construction
is frequent in the "real world" but isn't frequent in a corpus, that
would be a problem. Conversely, if the corpus indicates that the
word or phrase or grammatical construction is frequent, but it is
not frequent in the real world, that would be a problem as well.
Representativity can refer to a number
of different features -- such as time period (for example, whether
the texts from the 1920s in a historical corpus represent well what
people would have seen and heard in that decade) or dialect (for
example, whether the texts from Ireland in a corpus of international
English represent well what one would hear and see if s/he went to
Ireland). But for some reason, most researchers who look at
representativity look primarily (or exclusively) at genres. For
example, EBG 2022 deals almost exclusively with
just genres.
2. Representing genres.
Obviously, some linguistic features will be found more in very
informal language (such as TV or movie subtitles, or conversation),
while others might be found more in formal genres (such as academic
texts). For example, click on any of the following to see the
frequency in the different genres in
COCA (the Corpus of
Contemporary American English) and the
BNC (British National
Corpus):
|
Note: click on any link
on this page to see the corpus data, and then
click on the "BACK" image (see left) at the top of the page to come back to
this page. |
|
spoken (incl. COCA
TV/Movies) |
fiction |
newspaper |
academic |
words |
thanks
COCA
BNC |
frowned
C
B |
championship
C
B |
variance
C
B |
phrases |
I think
C
B |
BE startled
C
B |
winning NOUN
C
B |
in particular ,
C
B |
grammatical constructions |
CONJ PRON BE like , (and I'm like)
C
B |
had _vvn (had seen)
C
B |
has _vvn (has resulted)
C
B |
BE _vvn (was discovered)
C
B |
One goal of a "balanced" corpus would
be to allow users to see the frequency of a feature across the
different genres. For example, if the corpus is composed solely of
texts from newspapers or web pages (which are very easy to get), but
it doesn't have any texts from fiction, then words like (NOUN)
eyes,
stairs,
smile (ADJ)
pale,
faint,
dark (VERB)
stare,
fade,
lean (ADV)
softly,
gently will be very infrequent in the corpus.
But most native speakers of English wouldn't think of eyes
or dark or softly or lean (as a verb) as being
particularly strange, which shows how skewed the data from a
corpus that is based solely on newspapers or web texts might be.
3. Comparing genres in COCA and the
BNC. As can be seen in the following tables, both COCA and the
BNC have a number of genres, from informal (spoken, as well as
TV/Movie subtitles in COCA) to formal (academic).
COCA |
BNC (1994)
(notes on BNC 2014) |
Genre |
# texts |
# words |
TV/Movies |
23,975 |
129,293,467 |
Spoken |
44,803 |
127,396,932 |
Fiction |
25,992 |
119,505,305 |
Magazine |
86,292 |
127,352,030 |
Newspaper |
90,243 |
122,958,016 |
Academic |
26,137 |
120,988,361 |
Web-Blog |
98,748 |
125,496,216 |
Web-General |
88,989 |
129,899,427 |
TOTAL |
485,179 |
1,002,889,754 |
|
Genre |
# texts |
# words |
Spoken |
909 |
10,334,947 |
Non-Academic |
501 |
15,429,582 |
Fiction |
464 |
16,194,885 |
Magazine |
211 |
7,376,391 |
Newspaper |
518 |
10,638,034 |
Academic |
534 |
16,634,076 |
Miscellaneous |
916 |
21,011,396 |
TOTAL |
4,053 |
97,619,311 |
|
In this document we discuss the BNC
1994 release, not the BNC 2014 release. The written portion of BNC
2014 (which was actually
released in late
2021) is only available via the
proprietary LancsBox X software (which you must install on your own
computer), but not via the Web or as
downloadable files. While LancsBox X has been improved recently, it
is still much slower than COCA (more than 100 times as slow for some
types of searches), and it still provides only limited searches (for
example, no ability to search by more than the eight "macro genres"
and no
Virtual
Corpora).
In addition, for the vast majority of
the texts, the [sources] spreadsheet does not include enough
information to identify the original texts (such as a title or a
URL). As EBG 2022 notes, one
of the first things that should be available for a corpus is
information about what is in the corpus, including metadata about
the texts (such as we have
made available for COCA since the day it
was released, and for every update since then). Until BNC 2014
Written is publicly available (via the Web or via downloadable files
that you can analyze with any piece of software) and until it has
enough metadata about the texts to allow us to identify their
source, we will limit our discussion to
BNC 1994. If you are aware of a change in status for the BNC 2014
Written data, please email us and we will change this document
accordingly. (Updated Jan 2024)
Some genres are not overly
controversial, such as fiction, magazines, and newspapers. But three
genres (or groups of genres) merit more discussion, in terms of a
comparison between COCA and the BNC:
Academic |
ELB 2020 and EBG 2022 (see
above) have claimed that COCA academic
is poorly constructed, and that it doesn't represent well what
we know about academic from other corpora like the BNC and the
Longman Corpus. But actual data from the corpora show that this
is not true at all. If anything,
it is the academic texts in COCA that are more representative
of what we know about academic writing than the academic texts
in the BNC. |
Spoken (and
TV/Movies) |
The
texts in BNC spoken do a great job representing spoken language.
About half of the 10+ million words of data in BNC spoken is
from everyday conversation, and the other half is "context
dependent" (church sermons, courtroom arguments, classroom
discussion, etc). In COCA, the spoken texts
come from transcripts of unscripted conversation on national
TV and radio programs.
However, COCA also
has almost 130 million words from
TV and Movies subtitles. The
language in these texts is very informal, and it
models everyday
conversation extremely well. |
Web (Blogs and General) |
Nowadays, most people spend a good portion of their day reading
material from the Web. COCA has more than 250 million words of
data from the Web, and the texts have been
categorized very well into different
web genres by
Serge Sharoff. But because virtually all of the texts in the
BNC (1994) were collected before the Web was widely used, it
doesn't have any texts from this important genre.
In the 2014 update (which actually came out in 2021 -- 27
years after the original 1994 BNC), the BNC finally had some
texts from the web -- a little less than 5 million words
(compared to 250 million words in COCA). So this is a very importance difference between COCA
and the BNC in terms of how well the amount of web texts in
the corpus represents what we encounter in the "real
world". |
4. What about other genres?
EBG 2022 (page
263) criticizes COCA
because it "falls short of its claims to represent the entirety
of “contemporary American English”. There are at least three responses
to this criticism:
-
They say that I make this
claim -- that COCA represents "the entirety" of English -- on
page 176 of
Davies 2009. This is false. Neither on
page 176 of that article, nor anywhere else in
the article
(check for this by searching that article for the substring
represent*), nor
anywhere in
my 80+ publications, have I made this claim.
-
And of course, COCA does not
"represent the entirety of English" -- but nor does
any other corpus. For example, COCA doesn't have any
texts at all of people interacting with chiropractors, or menus
from Korean restaurants, or freeway billboards, or warning
labels from lawn care products, or EULA agreements for operating
systems, or dishwasher repair manuals – all of which would
presumably be included in a corpus of American English “in its
entirety". But the BNC doesn't have these either,
nor does any other corpus.
-
As their two books were being written,
the BNC still didn't have a single
text from the Web, but ELB 2022 and EBG 2022 fail to mention
this crucial fact. (Note that BNC 2014 Written, which was
released in late 2021, does have texts from the Web -- about
5 million words, compared to 250 million words in COCA).
EBG 2022 (page 36) say that I have
made the claim that "the notion of representativeness does not
really matter, because corpus size is the primary consideration"
(their exact wording).
-
Again, they are making up quotes.
Search all of my articles -- I have never made this claim, or
anything like it. This may be why they don't actually provide a
link to any of my articles for this quote.
Finally, EBG 2022 (page 37) say
that I am "untroubled by reservations about issues of balance and representativeness" (their exact wording).
-
Once again, this is false. I have written
many papers talking about corpus "balance"
-- that a good general corpus should have a number of genres
from informal (e.g. spoken) to formal (e.g. academic). But I
have strictly avoided using the "r word" (representativity),
because it is such a "charged" concept.
5. Comparing genres in COCA and the
Longman Corpus
ELB 2020 and EBG 2022 talk very
positively about the Longman Corpus of English, which served as the basis
of
Biber et al (1999). And I would
concur. In fact,
I liked this
grammar so much that I used the
"student" version of the grammar
more than twenty times, as I taught English Grammar at Brigham Young
University from 2003
to 2020 (including a section of the class that Jesse Egbert took as
a student).
But if ELB 2020 and EBG 2022 are so critical of COCA -- for "only" including the
eight genres listed above -- what about Biber's Longman Corpus? It
only has four genres -- spoken, fiction, newspaper, and
academic. And yet not once in either of these two books is there
ever any criticism of the Longman Corpus in terms of the fact that
it doesn't "represent the entirety of ... English".
6. "The proof of the pudding is in
the eating"
ELB 2020 and EBG 2022 repeatedly make
the claim that COCA has a number of serious problems, with a lack of
representativity being one of the main issues. But how to we square
that with the pre-eminent status of COCA in terms of corpus use?
-
The corpora from English-Corpora.org
are the most widely used corpora in the world, and COCA is by
far the most widely used of the 17 corpora at the site. Hundreds
of thousands of researchers, teachers, and students have found
the data from COCA to be more reliable and useful than that of
any other corpus.
-
Data from COCA has been used for
thousands of academic studies. Why would researchers use the
corpus so much, if it did such a bad job in providing data on
English?
-
Hundreds of companies have used
data from COCA for their products. This includes large
technology companies like Amazon, Facebook, Google, Microsoft,
IBM, Sony, Disney, Intel, Adobe, Samsung; and language-related
companies like Merriam-Webster, Dictionary.com, Grammarly,
Duolingo, TurnItIn, Oxford University Press, and Sketch Engine.
If COCA really did as bad of a job at providing data on English
as ELB 2020 and EBG 2022 claim, why would
these companies use data from COCA so extensively?
-
Rather than relying on biased
critiques of the COCA, just use the corpus yourself. Look for
any word, or phrase, or grammatical construction, and see
whether the corpus data agrees with your intuitions. You
be the judge -- not some "ivory tower experts"
with
their own agenda (i.e. attempting to make COCA look bad
before they apply for funding to create their own corpus of
American English). We believe that you will find the same thing that
hundreds of thousands of other users have found -- that COCA
provides unparalleled data on variation in English.
|