Note: this page makes several references to
Egbert, Larsson, Biber (2020) and
Egbert, Biber, Gray (2022)
convenience, we will refer to these as ELB 2020 and EBG
1. What is representativity? "Representativity" refers to how
well the texts in a corpus represent what is "going on in the
world". For example, if a word or phrase or grammatical construction
is frequent in the "real world" but isn't frequent in a corpus, that
would be a problem. Conversely, if the corpus indicates that the
word or phrase or grammatical construction is frequent, but it is
not frequent in the real world, that would be a problem as well.
Representativity can refer to a number
of different features -- such as time period (for example, whether
the texts from the 1920s in a historical corpus represent well what
people would have seen and heard in that decade) or dialect (for
example, whether the texts from Ireland in a corpus of international
English represent well what one would hear and see if s/he went to
Ireland). But for some reason, most researchers who look at
representativity look primarily (or exclusively) at genres. For
example, EBG 2022 deals almost exclusively with
2. Representing genres.
Obviously, some linguistic features will be found more in very
informal language (such as TV or movie subtitles, or conversation),
while others might be found more in formal genres (such as academic
texts). For example, click on any of the following to see the
frequency in the different genres in
COCA (the Corpus of
Contemporary American English) and the
BNC (British National
||Note: click on any link
on this page to see the corpus data, and then
click on the "BACK" image (see left) at the top of the page to come back to
||spoken (incl. COCA
||in particular ,
||CONJ PRON BE like , (and I'm like)
||had _vvn (had seen)
||has _vvn (has resulted)
||BE _vvn (was discovered)
One goal of a "balanced" corpus would
be to allow users to see the frequency of a feature across the
different genres. For example, if the corpus is composed solely of
texts from newspapers or web pages (which are very easy to get), but
it doesn't have any texts from fiction, then words like (NOUN)
gently will be very infrequent in the corpus.
But most native speakers of English wouldn't think of eyes
or dark or softly or lean (as a verb) as being
particularly strange, which shows how skewed the data from a
corpus that is based solely on newspapers or web texts might be.
3. Comparing genres in COCA and the
BNC. As can be seen in the following tables, both COCA and the
BNC have a number of genres, from informal (spoken, as well as
TV/Movie subtitles in COCA) to formal (academic).
(notes on BNC 2014)
In this document we discuss the BNC
1994 release, not the BNC 2014 release. The written portion of BNC
2014 was supposedly
released in late
2021, but as of July 2023 it is
still not really publicly available. It is only available via the
proprietary LancsBox X software, but not via the Web or as
downloadable files. Apparently there is still (July 2023) no
publicly-accessible information on sub-genres (size, etc), and
especially no publicly-accessible metadata for the 88,000+ texts in
the written portion of the corpus. As EBG 2022 notes, one
of the first things that should be available for a corpus is
information about what is in the corpus, including metadata about
the texts (such as we have
made available for COCA since the day it
was released, and for every update since then). Until something
similar is available for BNC 2014, we will limit our discussion to
the portion of the BNC that is truly publicly available – the 1994
data. If you are aware of a change in status for the BNC 2014
Written data, please email us and we will change this document
Some genres are not overly
controversial, such as fiction, magazines, and newspapers. But three
genres (or groups of genres) merit more discussion, in terms of a
comparison between COCA and the BNC:
ELB 2020 and EBG 2022 (see
above) have claimed that COCA academic
is poorly constructed, and that it doesn't represent well what
we know about academic from other corpora like the BNC and the
Longman Corpus. But actual data from the corpora show that this
is completely wrong. If anything, the texts in COCA academic are more representative
of what we know about academic than those of the BNC.
texts in BNC spoken do a great job representing spoken language.
About half of the 10+ million words of data in BNC spoken is
from everyday conversation, and the other half is "context
dependent" (church sermons, courtroom arguments, classroom
discussion, etc). In COCA, the spoken texts
come from transcripts of unscripted conversation on national
TV and radio programs.
However, COCA also
has almost 130 million words from
TV and Movies subtitles. The
language in these texts is very informal, and it
conversation extremely well.
Web (Blogs and General)
Nowadays, most people spend a good portion of their day reading
material from the Web. COCA has more than 250 million words of
data from the Web, and the texts have been
categorized very well into different
web genres by
Serge Sharoff. But because virtually all of the texts in the
BNC (1994) were collected before the Web was widely used, it
doesn't have any texts from this important genre. There are
texts from the Web in BNC 2014, but that corpus is not really
publicly available. So this is a
huge difference between COCA
and the BNC in terms of how well they represent the "real
4. What about other genres?
EBG 2022 (page
263) criticizes COCA
because it "falls short of its claims to represent the entirety
of “contemporary American English”. There are at least three responses
to this criticism:
They say that I make this
claim -- that COCA represents "the entirety" of English -- on
page 176 of
Davies 2009. This is false. Neither on
page 176 of that article, nor anywhere else in
(check for this by searching that article for the substring
my 80+ publications, have I ever made this claim.
And of course, COCA does not
"represent the entirety of English" -- but nor does
any other corpus. For example, COCA doesn't have any
texts at all of people interacting with chiropractors, or menus
from Korean restaurants, or freeway billboards, or warning
labels from lawn care products, or EULA agreements for operating
systems, or dishwasher repair manuals – all of which would
presumably be included in a corpus of American English “in its
entirety". But guess what? The BNC doesn't have these either,
nor does any other corpus. Why the animus against COCA, and
As their two books were being written,
the BNC still didn't have a single
text from the Web (including blogs). But ELB 2022 and EBG 2022 fail to mention this crucial fact entirely. Why is this?
EBG 2022 (page 36) say that I have
made the claim that "the notion of representativeness does not
really matter, because corpus size is the primary consideration"
(their exact wording).
Again, they are making up quotes.
Search all of my articles -- I have never made this claim, or
anything like it. And that's why they don't actually provide a
link to any of my articles for this quote.
Finally, EBG 2022 (page 37) say
that I am "untroubled by reservations about issues of balance and representativeness" (their exact wording).
Once again, this is completely
false. I have written
many papers talking about corpus "balance"
-- that a good general corpus should have a number of genres
from informal (e.g. spoken) to formal (e.g. academic). But I
have strictly avoided using the "r word" (representativity),
because it is such a "charged" concept.
5. Comparing genres in COCA and the
ELB 2020 and EBG 2022 are not shy about singing the
praises of the Longman Corpus of English, which served as the basis
Biber et al (1999). And I would agree. In fact,
I liked this
grammar so much that I used the
"student" version of the grammar
more than twenty times, as I taught English Grammar at Brigham Young
University from 2003
to 2020 (including a section of the class that Jesse Egbert took as
But if ELB 2020 and EBG 2022 are so critical of COCA -- for "only" including the
eight genres listed above -- what about Biber's Longman Corpus? It
only has four genres -- spoken, fiction, newspaper, and
academic. And yet not once in either of these two books is there
ever any criticism of the Longman Corpus in terms of the fact that
it doesn't "represent the entirety of ... English".
6. "The proof of the pudding is in
ELB 2020 and EBG 2022 repeatedly make
the claim that COCA has a number of serious problems, with a lack of
representativity being one of the main issues. But how to we square
that with the pre-eminent status of COCA in terms of corpus use?
The corpora from English-Corpora.org
are the most widely used corpora in the world, and COCA is by
far the most widely used of the 17 corpora at the site. Hundreds
of thousands of researchers, teachers, and students have found
the data from COCA to be more reliable and useful than that of
any other corpus.
Data from COCA has been used for
thousands of academic studies. Why would researchers use the
corpus so much, if it did such a bad job providing data on
Hundreds of companies have used
data from COCA for their products. This includes large
technology companies like Amazon, Facebook, Google, Microsoft,
IBM, Sony, Disney, Intel, Adobe, Samsung; and language-related
companies like Merriam-Webster, Dictionary.com, Grammarly,
Duolingo, TurnItIn, Oxford University Press, and Sketch Engine.
If COCA really did as bad of a job at providing data on English
as ELB 2020 and EBG 2022 claim, why would
these companies use data from COCA so extensively?
Rather than relying on biased
critiques of the COCA, just use the corpus yourself. Look for
any word, or phrase, or grammatical construction, and see
whether the corpus data agrees with your intuitions. You
be the judge -- not some "ivory tower" (data:
BNC) "experts". We believe that you will find the same thing that
hundreds of thousands of other users have found -- that COCA
provides unparalleled data on variation in English.