English-Corpora.org

English-Corpora.org


Note: this page makes several references to Egbert, Larsson, Biber (2020) and Egbert, Biber, Gray (2022) (more). For convenience, we will refer to these as ELB 2020 and EBG 2022.


1. What is representativity? "Representativity" refers to how well the texts in a corpus represent what is "going on in the real world". For example, if a word or phrase or grammatical construction is frequent in the "real world" but isn't frequent in a corpus, that would be a problem. Conversely, if the corpus indicates that the word or phrase or grammatical construction is frequent, but it is not frequent in the real world, that would be a problem as well.

Representativity can refer to a number of different features -- such as time period (for example, whether the texts from the 1920s in a historical corpus represent well what people would have seen and heard in that decade) or dialect (for example, whether the texts from Ireland in a corpus of international English represent well what one would hear and see if s/he went to Ireland). But for some reason, most researchers who look at representativity look primarily (or exclusively) at genres. For example, EBG 2022 deals almost exclusively with just genres.

2. Representing genres. Obviously, some linguistic features will be found more in very informal language (such as TV or movie subtitles, or conversation), while others might be found more in formal genres (such as academic texts). For example, click on any of the following to see the frequency in the different genres in COCA (the Corpus of Contemporary American English) and the BNC (British National Corpus):

Note: click on any link on this page to see the corpus data, and then click on the "BACK" image (see left) at the top of the page to come back to this page.
  spoken (incl. COCA TV/Movies) fiction newspaper academic
words thanks  COCA BNC frowned  C B championship  C B variance  C B
phrases I think  C B BE startled  C B winning NOUN  C B in particular ,  C B
grammatical constructions CONJ PRON BE like , (and I'm like C B had _vvn (had seen C B has _vvn (has resulted C B BE _vvn (was discovered)  C B

One goal of a "balanced" corpus would be to allow users to see the frequency of a feature across the different genres. For example, if the corpus is composed solely of texts from newspapers or web pages (which are very easy to get), but it doesn't have any texts from fiction, then words like (NOUN) eyes, stairs, smile (ADJ) pale, faint, dark (VERB) stare, fade, lean (ADV) softly, gently will be very infrequent in the corpus. But most native speakers of English wouldn't think of eyes or dark or softly or lean (as a verb) as being particularly strange, which shows how skewed the data from a corpus that is based solely on newspapers or web texts might be.

3. Comparing genres in COCA and the BNC. As can be seen in the following tables, both COCA and the BNC have a number of genres, from informal (spoken, as well as TV/Movie subtitles in COCA) to formal (academic).

COCA BNC (1994) (notes on BNC 2014)
Genre # texts # words
TV/Movies 23,975 129,293,467
Spoken 44,803 127,396,932
Fiction 25,992 119,505,305
Magazine 86,292 127,352,030
Newspaper 90,243 122,958,016
Academic 26,137 120,988,361
Web-Blog 98,748 125,496,216
Web-General 88,989 129,899,427
TOTAL 485,179 1,002,889,754
Genre # texts # words
Spoken 909 10,334,947
Non-Academic 501 15,429,582
Fiction 464 16,194,885
Magazine 211 7,376,391
Newspaper 518 10,638,034
Academic 534 16,634,076
Miscellaneous 916 21,011,396
TOTAL 4,053 97,619,311

Some genres are not overly controversial, such as fiction, magazines, and newspapers. But three genres (or groups of genres) merit more discussion, in terms of a comparison between COCA and the BNC:

Academic

ELB 2020 and EBG 2022 (see above) have claimed that COCA academic is poorly constructed, and that it doesn't represent well what we know about academic from other corpora like the BNC and the Longman Corpus. But actual data from the corpora show that this is wrong. If anything, the texts in COCA academic are more representative of what we know about academic than those of the BNC.

Spoken (and TV/Movies)

The texts in BNC spoken do a great job representing spoken language. About half of the 10+ million words of data in BNC spoken is from everyday conversation, and the other half is "context dependent" (church sermons, courtroom arguments, classroom discussion, etc). In COCA, the spoken texts come from transcripts of unscripted conversation on national TV and radio programs. However, COCA also has almost 130 million words from TV and Movies subtitles. The language in these texts is very informal, and it models everyday conversation extremely well.

Web (Blogs and General)

Nowadays, most people spend a good portion of their day reading material from the Web. COCA has more than 250 million words of data from the Web, and the texts have been categorized very well into different web genres by Serge Sharoff. But because virtually all of the texts in the BNC (1994) were collected before the Web was widely used, it doesn't have any texts from this important genre.

In the 2014 update (which actually came out in 2021 -- 27 years after the original 1994 BNC), the BNC finally had some texts from the web -- a little less than 5 million words (compared to 250 million words in COCA). So this is a very importance difference between COCA and the BNC in terms of how well the amount of web texts in the corpus represents what we encounter in the "real world".

4. What about other genres?

EBG 2022 (page 263) criticizes COCA because it "falls short of its claims to represent the entirety of “contemporary American English”. There are at least three responses to this criticism:

  • They say that I make this claim -- that COCA represents "the entirety" of English -- on page 176 of Davies 2009. This is false. Neither on page 176 of that article, nor anywhere else in the article (check for this by searching that article for the substring represent*), nor anywhere in my 80+ publications, have I made this claim.

  • And of course, COCA does not "represent the entirety of English" -- but nor does any other corpus. For example, COCA doesn't have any texts at all of people interacting with chiropractors, or menus from Korean restaurants, or freeway billboards, or warning labels from lawn care products, or EULA agreements for operating systems, or dishwasher repair manuals – all of which would presumably be included in a corpus of American English “in its entirety". But the BNC doesn't have these either, nor does any other corpus.

  • As their two books were being written, the BNC still didn't have a single text from the Web, but ELB 2022 and EBG 2022 fail to mention this crucial fact. (Note that BNC 2014 Written, which was released in late 2021, does have texts from the Web -- about 5 million words, compared to 250 million words in COCA).

EBG 2022 (page 36) say that I have made the claim that "the notion of representativeness does not really matter, because corpus size is the primary consideration" (their exact wording).

  • Again, they are making up quotes. Search all of my articles -- I have never made this claim, or anything like it. This may be why they don't actually provide a link to any of my articles for this quote.

Finally, EBG 2022 (page 37) say that I am "untroubled by reservations about issues of balance and representativeness" (their exact wording).

  • Once again, this is false. I have written many papers talking about corpus "balance" -- that a good general corpus should have a number of genres from informal (e.g. spoken) to formal (e.g. academic). But I have strictly avoided using the "r word" (representativity), because it is such a "charged" concept.

5. Comparing genres in COCA and the Longman Corpus

ELB 2020 and EBG 2022 talk very positively about the Longman Corpus of English, which served as the basis of Biber et al (1999). And I would concur. In fact, I liked this grammar so much that I used the "student" version of the grammar more than twenty times, as I taught English Grammar at Brigham Young University from 2003 to 2020 (including a section of the class that Jesse Egbert took as a student).

But if ELB 2020 and EBG 2022 are so critical of COCA -- for "only" including the eight genres listed above -- what about Biber's Longman Corpus? It only has four genres -- spoken, fiction, newspaper, and academic. And yet not once in either of these two books is there ever any criticism of the Longman Corpus in terms of the fact that it doesn't "represent the entirety of ... English".


6. "The proof of the pudding is in the eating"

ELB 2020 and EBG 2022 repeatedly make the claim that COCA has a number of serious problems, with a lack of representativity being one of the main issues. But how to we square that with the pre-eminent status of COCA in terms of corpus use?

  • The corpora from English-Corpora.org are the most widely used corpora in the world, and COCA is by far the most widely used of the 17 corpora at the site. Hundreds of thousands of researchers, teachers, and students have found the data from COCA to be more reliable and useful than that of any other corpus.

  • Data from COCA has been used for thousands of academic studies. Why would researchers use the corpus so much, if it did such a bad job providing data on English?

  • Hundreds of companies have used data from COCA for their products. This includes large technology companies like Amazon, Facebook, Google, Microsoft, IBM, Sony, Disney, Intel, Adobe, Samsung; and language-related companies like Merriam-Webster, Dictionary.com, Grammarly, Duolingo, TurnItIn, Oxford University Press, and Sketch Engine. If COCA really did as bad of a job at providing data on English as ELB 2020 and EBG 2022 claim, why would these companies use data from COCA so extensively?

  • Rather than relying on biased critiques of the COCA, just use the corpus yourself. Look for any word, or phrase, or grammatical construction, and see whether the corpus data agrees with your intuitions. You be the judge -- not some "ivory tower experts". We believe that you will find the same thing that hundreds of thousands of other users have found -- that COCA provides unparalleled data on variation in English.