COCA / BNC: Web texts
The recent books
Doing Linguistics with a Corpus (2020) by Egbert, Larsson, and Biber (hereafter ELB 2020)
and
Designing and Evaluating Language Corpora (2022) by Egbert,
Biber, and Gray (hereafter EBG 2022) provide extended
analyses of the Corpus of Contemporary American English (COCA) and
the British National Corpus (BNC). A fundamental question in both
books is how well different corpora (such as COCA and the BNC)
"represent" the range of genres in the language. They focus in
particular on academic English, but as we have shown
elsewhere, their data is very
problematic.
One genre that they
ignore almost completely are texts from the Web. This is
strange, since the web is so much a part of our life nowadays. To
us, it seems rather bizarre to focus on rather minor, perceived
issues of "representativity" in academic texts (and again,
where the data does not support their claims), and then to ignore
completely the "elephant in the room" -- that the publicly
available BNC is missing an entire genre -- texts from the Web.
In this page, we will discuss both BNC 1994 and BNC 2014 (which was
actually released -- to the extent that it has been "released"
-- in
2021).
COCA
Before discussing the BNC, however, let's briefly discuss the Web
texts in COCA. The
COCA corpus has more than 250 million words of data from the Web
(divided almost evenly between blogs and other texts from the Web),
and the texts have been
categorized very well into different
web genres by
Serge Sharoff. The importance of the web as a genre is
exemplified by the following words, which are nouns that are at
least three times as common in the Web portion of COCA as in the
corpus as a whole: browser, blog,
blogging, spoiler, app, libertarian, functionality, font, download,
progressive, trope, server, smartphone, desktop, email, spam,
password, website, liberal, podcast, e-book, warranty, tablet,
tweet, template, conservative, default, posting, tutorial, update,
sequel, vegan, programmer, node, user, episode, attribute,
navigation, ignorance, keyword, layout, stupidity, upgrade,
timeline, stat, mobile, geek, web, troll, installment,
specification, thread And the
following nouns are at least twice as common:
page, internet, file, character, vote, version, book, universe,
article, argument, content, chapter, media, religion, opinion,
climate, document, device, video, software, example, element,
amount, author, option, faith, application, copy, description,
income, search, title, error, list, person, account, section,
American, site, topic, fact It's hard to
imagine a corpus that would have comparatively few tokens of these words. Unfortunately, that is the case for the
publicly-available BNC (which is BNC 1994). BNC 1994
The latest texts in BNC 1994 are from 1993, which was (unfortunately
for the corpus) right before the Web took off in the mid-1990s. In
terms of excusing the BNC for having no texts from the Web, some
might say that "well, the Web (and certainly blogs) weren't a thing
in the 1980s and early 1990s when the BNC was being created", so why
criticize it for not having something that wasn't really available
in 1994? But that's just the point. The
BNC did a very good job representing genres from 1990, but less so
in 1995 (when the Web was already becoming popular), or in 2000, . .
. or in 2005 (when blogs existed), or definitely in 2010 . . . or
2015 . . . or 2020 (when the BNC Written update was still not
available). It makes absolutely no sense
to quibble about presumed issues with COCA in terms of
representativity (and where the data
shows that COCA is actually better than the BNC), and then ignore "the elephant in the room" -- the fact
that as
Egbert, Larsson, and Biber (2020) and
Egbert, Biber, and
Gray (2022) were being written -- the BNC was completely missing
material from an entire genre, and one that has been a huge part of
people's language input for the previous 20-25 years. Why is it that
those who claim to care so much about “representativity” have given
the BNC a "pass" on this for several decades – with only the promise
that at some point it would all be fixed? Why is there no discussion
of this incredibly important “missing genre” in these books?
And even beyond this general issue of no Web / blog texts in the
BNC, there is the issue of "missing lexis" in other genres
(magazines, newspapers, etc.) for the last 20-30 years. Go ahead and
search the BNC for words (such as those listed above) relating to technology or societal and
cultural change since the 1990s in these other genres – they're just
not there, or they are much less common than in COCA.
As long as researchers simply limit themselves to
grammatical differences between genres (which is the approach of
many “experts” in
corpus design), there's no problem. But if they
were to look at lexis (words) as well, then all of the sudden the
BNC massively fails to be "representative" of the language during
the last 20-30 years. And yes, words are part of the language too.
BNC 2014 When the 2014 update to
the written portion of the BNC was being designed, there were
(finally) plans to include texts from the Web. Unfortunately, things
haven't really progressed as expected.
1. First, BNC 2014 Written wasn't released until December 2021 --
seven years after the "2014" date in its title, and almost 30 years
after BNC 1994. 2. BNC Written 20214 is
only available via LancsBox, which is a fairly rudimentary piece of
software. Essentially the only thing that one can do with LancsBox
(including the LancsBox X update) is to search for a word or phrase,
and then see concordance lines. For example, it is completely
impossible to find the collocates of a word (which is normally a
pretty basic task for corpus software), which means that it's almost
impossible to look in a meaningful way at the meaning or usage of a
word -- other than manually slogging through individual concordance
lines. 3. There is absolutely
no
documentation on the size of the corpus by
sub-genre -- such as
Academic-Medicine, Newspaper-Sports, or any of the
categories
that have been
assigned to Web texts in COCA, and so no way to meaningful
compare frequencies in different parts of the corpus.
4. Perhaps most importantly, there is still (as of November 2023)
no
publicly-available list of the texts in BNC 2014 Written. We simply
do not have any idea what is in the corpus. As
Egbert, Larsson, and
Biber (2020:8-14) note, one of the very first things that should be done
when a corpus is released is to provide documentation on what is in
the corpus. Detailed information about each and every text in COCA
has been available since the day it was first released in 2008, and
for every update since then (see the
data for the 2019 update).
In summary, while BNC Written 2014
(released in 2021) may (finally) have some texts from the Web, we
don't have any idea what those texts are, and we don't have any
meaningful way to search this corpus (since we are limited to the
rudimentary LancsBox software). So until BNC 2014 is really
available in a format in which we can examine what is in it, we will
refer to it as not yet truly "publicly available". |