COCA / BNC: Web texts

The recent books Doing Linguistics with a Corpus (2020) by Egbert, Larsson, and Biber (hereafter ELB 2020) and Designing and Evaluating Language Corpora (2022) by Egbert, Biber, and Gray (hereafter EBG 2022) provide extended analyses of the Corpus of Contemporary American English (COCA) and the British National Corpus (BNC). A fundamental question in both books is how well different corpora (such as COCA and the BNC) "represent" the range of genres in the language. They focus in particular on academic English, but as we have shown elsewhere, their data is very problematic.

One genre that they ignore almost completely are texts from the Web. This is strange, since the web is so much a part of our life nowadays. To us, it seems rather bizarre to focus on rather minor, perceived issues of "representativity" in academic texts (and again, where the data does not support their claims), and then to ignore completely the "elephant in the room" -- that the publicly available BNC is missing an entire genre -- texts from the Web. In this page, we will discuss both BNC 1994 and BNC 2014 (which was actually released -- to the extent that it has been "released" -- in 2021).


Before discussing the BNC, however, let's briefly discuss the Web texts in COCA. The COCA corpus has more than 250 million words of data from the Web (divided almost evenly between blogs and other texts from the Web), and the texts have been categorized very well into different web genres by Serge Sharoff. The importance of the web as a genre is exemplified by the following words, which are nouns that are at least three times as common in the Web portion of COCA as in the corpus as a whole:

browser, blog, blogging, spoiler, app, libertarian, functionality, font, download, progressive, trope, server, smartphone, desktop, email, spam, password, website, liberal, podcast, e-book, warranty, tablet, tweet, template, conservative, default, posting, tutorial, update, sequel, vegan, programmer, node, user, episode, attribute, navigation, ignorance, keyword, layout, stupidity, upgrade, timeline, stat, mobile, geek, web, troll, installment, specification, thread

And the following nouns are at least twice as common:

page, internet, file, character, vote, version, book, universe, article, argument, content, chapter, media, religion, opinion, climate, document, device, video, software, example, element, amount, author, option, faith, application, copy, description, income, search, title, error, list, person, account, section, American, site, topic, fact

It's hard to imagine a corpus that would have comparatively few tokens of these words. Unfortunately, that is the case for the publicly-available BNC (which is BNC 1994).

BNC 1994

The latest texts in BNC 1994 are from 1993, which was (unfortunately for the corpus) right before the Web took off in the mid-1990s. In terms of excusing the BNC for having no texts from the Web, some might say that "well, the Web (and certainly blogs) weren't a thing in the 1980s and early 1990s when the BNC was being created", so why criticize it for not having something that wasn't really available in 1994?

But that's just the point. The BNC did a very good job representing genres from 1990, but less so in 1995 (when the Web was already becoming popular), or in 2000, . . . or in 2005 (when blogs existed), or definitely in 2010 . . . or 2015 . . . or 2020 (when the BNC Written update was still not available).

It makes absolutely no sense to quibble about presumed issues with COCA in terms of representativity (and where the data shows that COCA is actually better than the BNC), and then ignore "the elephant in the room" -- the fact that as Egbert, Larsson, and Biber (2020) and Egbert, Biber, and Gray (2022) were being written -- the BNC was completely missing material from an entire genre, and one that has been a huge part of people's language input for the previous 20-25 years. Why is it that those who claim to care so much about “representativity” have given the BNC a "pass" on this for several decades – with only the promise that at some point it would all be fixed? Why is there no discussion of this incredibly important “missing genre” in these books?

And even beyond this general issue of no Web / blog texts in the BNC, there is the issue of "missing lexis" in other genres (magazines, newspapers, etc.) for the last 20-30 years. Go ahead and search the BNC for words (such as those listed above) relating to technology or societal and cultural change since the 1990s in these other genres – they're just not there, or they are much less common than in COCA.

As long as researchers simply limit themselves to grammatical differences between genres (which is the approach of many “experts” in corpus design), there's no problem. But if they were to look at lexis (words) as well, then all of the sudden the BNC massively fails to be "representative" of the language during the last 20-30 years. And yes, words are part of the language too.

BNC 2014

When the 2014 update to the written portion of the BNC was being designed, there were (finally) plans to include texts from the Web. Unfortunately, things haven't really progressed as expected.

1. First, BNC 2014 Written wasn't released until December 2021 -- seven years after the "2014" date in its title, and almost 30 years after BNC 1994.

2. BNC Written 20214 is only available via LancsBox, which is a fairly rudimentary piece of software. Essentially the only thing that one can do with LancsBox (including the LancsBox X update) is to search for a word or phrase, and then see concordance lines. For example, it is completely impossible to find the collocates of a word (which is normally a pretty basic task for corpus software), which means that it's almost impossible to look in a meaningful way at the meaning or usage of a word -- other than manually slogging through individual concordance lines.

3. There is absolutely no documentation on the size of the corpus by sub-genre -- such as Academic-Medicine, Newspaper-Sports, or any of the categories that have been assigned to Web texts in COCA, and so no way to meaningful compare frequencies in different parts of the corpus.

4. Perhaps most importantly, there is still (as of November 2023) no publicly-available list of the texts in BNC 2014 Written. We simply do not have any idea what is in the corpus. As Egbert, Larsson, and Biber (2020:8-14) note, one of the very first things that should be done when a corpus is released is to provide documentation on what is in the corpus. Detailed information about each and every text in COCA has been available since the day it was first released in 2008, and for every update since then (see the data for the 2019 update).

In summary, while BNC Written 2014 (released in 2021) may (finally) have some texts from the Web, we don't have any idea what those texts are, and we don't have any meaningful way to search this corpus (since we are limited to the rudimentary LancsBox software). So until BNC 2014 is really available in a format in which we can examine what is in it, we will refer to it as not yet truly "publicly available".