English Corpora: most widely used online corpora. Billions of words of data: free online access

COCA / BNC: Web texts

The recent books Doing Linguistics with a Corpus (2020) by Egbert, Larsson, and Biber (hereafter ELB 2020) and Designing and Evaluating Language Corpora (2022) by Egbert, Biber, and Gray (hereafter EBG 2022) provide extended analyses of the Corpus of Contemporary American English (COCA) and the British National Corpus (BNC). A fundamental question in both books is how well different corpora (such as COCA and the BNC) "represent" the range of genres in the language. They focus in particular on academic English, but as we have shown elsewhere, their data is very problematic.

One genre that they ignore almost completely are texts from the Web. This is strange, since the web is so much a part of our life nowadays. To us, it seems rather bizarre to focus on rather minor, perceived issues of "representativity" in academic texts (and again, where the data does not support their claims), and then to ignore completely the "elephant in the room" -- that BNC 1994 Written is missing an entire genre -- texts from the Web. In addition, while there are some Web-based texts that are available in BNC Written 2014 (released in 2021), this is much, much smaller than what is available from COCA.

COCA

Before discussing the BNC, however, let's briefly discuss the Web texts in COCA. The COCA corpus has more than 250 million words of data from the Web (divided almost evenly between blogs and other texts from the Web), and the texts have been categorized very well into different web genres by Serge Sharoff. And in COCA, you can limit your searches and sort the results by any of these 22 web-based genres, such as Web-Reviews or Blog-Personal.

The importance of the web as a genre is exemplified by the following words, which are nouns that are at least three times as common in the Web portion of COCA as in the corpus as a whole:

browser, blog, blogging, spoiler, app, libertarian, functionality, font, download, progressive, trope, server, smartphone, desktop, email, spam, password, website, liberal, podcast, e-book, warranty, tablet, tweet, template, conservative, default, posting, tutorial, update, sequel, vegan, programmer, node, user, episode, attribute, navigation, ignorance, keyword, layout, stupidity, upgrade, timeline, stat, mobile, geek, web, troll, installment, specification, thread

And the following nouns are at least twice as common:

page, internet, file, character, vote, version, book, universe, article, argument, content, chapter, media, religion, opinion, climate, document, device, video, software, example, element, amount, author, option, faith, application, copy, description, income, search, title, error, list, person, account, section, site, topic, fact

BNC 1994

The latest texts in BNC 1994 are from 1993, which was (unfortunately for the corpus) right before the Web took off in the mid-1990s. In terms of excusing the BNC for having no texts from the Web, some might say that "well, the Web (and certainly blogs) weren't a thing in the 1980s and early 1990s when the BNC was being created", so why criticize it for not having something that wasn't really available in 1994?

But that's just the point. The BNC did a very good job representing genres from 1990, but less so in 1995 (when the Web was already becoming popular), or in 2000, . . . or in 2005 (when blogs existed), or definitely in 2010 . . . or 2015 . . . or 2020 (when the BNC Written update was still not available).

It makes absolutely no sense to quibble about presumed issues with COCA in terms of representativity (and where the data shows that COCA is actually better than the BNC), and then ignore "the elephant in the room" -- the fact that as Egbert, Larsson, and Biber (2020) and Egbert, Biber, and Gray (2022) were being written -- the BNC was completely missing material from an entire genre, and one that has been a huge part of people's language input for the previous 20-25 years. Why is it that those who claim to care so much about “representativity” have given the BNC a "pass" on this for several decades – with only the promise that at some point it would all be fixed? Why is there no discussion of this incredibly important “missing genre” in these books?

And even beyond this general issue of no Web / blog texts in the BNC, there is the issue of "missing lexis" in other genres (magazines, newspapers, etc.) for the last 20-30 years. Go ahead and search the BNC for words (such as those shown in the first list above) relating to technology or societal and cultural change since the 1990s in the corpus as a whole. For example, 44 of these 52 words are more common in COCA than the BNC, and some words (such as blog, website, email, smartphone, download, tweet, posting, blogging, e-book, geek, timeline, app, browser) are at least 20 times as frequent (per million words) in COCA, or don't occur in the BNC at all.

As long as researchers simply limit themselves to grammatical differences between genres (which is the approach of many “experts” in corpus design), there's no problem. But if they were to look at lexis (words) as well, then all of the sudden the BNC massively fails to be "representative" of the language during the last 20-30 years. And yes, words are part of the language too.

BNC 2014 (much more information available here)

When the 2014 update to the written portion of the BNC was being designed, there were (finally) plans to include texts from the Web. Unfortunately, things haven't really progressed as expected.

1. First, BNC 2014 Written wasn't released until 2021 -- seven years after the "2014" date in its title, and almost 30 years after BNC 1994.

2. The "Web texts" portion of BNC 2014 Written is very small -- less than 5 million words in size (see Section 1 here). COCA has about 50 times as much data (250 million words) from web texts. In fact, in BNC 2014 Written, there is more data from "annual business reports" than there is from Web texts. Assuming that a corpus should "represent" what people see and hear in the "real world", this design is a real problem.

3. BNC Written 2014 is only available via LancsBox, a piece of proprietary software that must be installed on your own machine. LancsBox X has progressed markedly from Versions 1 and 2 (released in 2021 and 2022), where it was impossible to find collocates, and where users were limited mostly to just seeing the word or phrase in concordance lines. The newest version (3.0) has some nice features (include GraphColl for collocates), but (even if it is installed on a fairly robust computer), it is still much, much slower than COCA (which is 10 times the size of BNC 2014). (See Section 5 here)

4. While it is possible in LancsBox to limit searches to one of the eight "macro-registers" (Web, spoken, fiction, etc), it is still not possible to limit searches to parts of the Web. In COCA, on the other hand, you can limit searches to one of the 22 categori es that have been assigned to Web texts (such as Web-Reviews or Blog-Personal). In addition, in COCA you can quickly and easily create Virtual Corpora for specific websites or topics. Nothing like that is possible in LancsBox.

5. Perhaps most importantly, there is still (as of January 2024) limited information about what is in the Web portion of BNC 2014 Written. In the spreadsheet that lists the texts in the corpus, there are no links to the original texts online (for about 80% of the content from web texts), or even any title for any of the web pages. As Egbert, Larsson, and Biber (2020:8-14) note, one of the very first things that should be done when a corpus is released is to provide documentation on what is in the corpus. Detailed information about each and every text in COCA has been available since the day it was first released in 2008, and for every update since then (see the data for the 2019 update).

In summary, while BNC Written 2014 (released in 2021) may (finally) have some texts from the Web, it is still much smaller than COCA (5 million words in the BNC, compared to 250 million in COCA). In addition, it can only be accessed on your own computer with the LancsBox program, which is much, much slower than COCA, and which does not allow you to search specific portions of the web texts. There is also relatively little information about the web texts in the BNC, compared to what is available for COCA.

English-Corpora.org