COCA / BNC: Web texts
The recent books
Doing Linguistics with a Corpus (2020) by Egbert, Larsson, and Biber (hereafter ELB 2020)
and
Designing and Evaluating Language Corpora (2022) by Egbert,
Biber, and Gray (hereafter EBG 2022) provide extended
analyses of the Corpus of Contemporary American English (COCA) and
the British National Corpus (BNC). A fundamental question in both
books is how well different corpora (such as COCA and the BNC)
"represent" the range of genres in the language. They focus in
particular on academic English, but as we have shown
elsewhere, their data is very
problematic.
One genre that they
ignore almost completely are texts from the Web. This is
strange, since the web is so much a part of our life nowadays. To
us, it seems rather bizarre to focus on rather minor, perceived
issues of "representativity" in academic texts (and again,
where the data does not support their claims), and then to ignore
completely the "elephant in the room" -- that
BNC 1994 Written is missing an entire genre -- texts from the Web.
In addition, while there are some Web-based texts that are available
in BNC Written 2014 (released in 2021), this is much, much smaller
than what is available from COCA.
COCA
Before discussing the BNC, however, let's briefly discuss the Web
texts in COCA. The
COCA corpus has more than 250 million words of data from the Web
(divided almost evenly between blogs and other texts from the Web),
and the texts have been
categorized very well into different
web genres by
Serge Sharoff. And in COCA, you can limit your searches and
sort the results by any of these 22 web-based genres, such as Web-Reviews or Blog-Personal.
The importance of the web as a genre is
exemplified by the following words, which are nouns that are at
least three times as common in the Web portion of COCA as in the
corpus as a whole: browser, blog,
blogging, spoiler, app, libertarian, functionality, font, download,
progressive, trope, server, smartphone, desktop, email, spam,
password, website, liberal, podcast, e-book, warranty, tablet,
tweet, template, conservative, default, posting, tutorial, update,
sequel, vegan, programmer, node, user, episode, attribute,
navigation, ignorance, keyword, layout, stupidity, upgrade,
timeline, stat, mobile, geek, web, troll, installment,
specification, thread And the
following nouns are at least twice as common:
page, internet, file, character, vote, version, book, universe,
article, argument, content, chapter, media, religion, opinion,
climate, document, device, video, software, example, element,
amount, author, option, faith, application, copy, description,
income, search, title, error, list, person, account, section,
site, topic, fact BNC 1994
The latest texts in BNC 1994 are from 1993, which was (unfortunately
for the corpus) right before the Web took off in the mid-1990s. In
terms of excusing the BNC for having no texts from the Web, some
might say that "well, the Web (and certainly blogs) weren't a thing
in the 1980s and early 1990s when the BNC was being created", so why
criticize it for not having something that wasn't really available
in 1994? But that's just the point. The
BNC did a very good job representing genres from 1990, but less so
in 1995 (when the Web was already becoming popular), or in 2000, . .
. or in 2005 (when blogs existed), or definitely in 2010 . . . or
2015 . . . or 2020 (when the BNC Written update was still not
available). It makes absolutely no sense
to quibble about presumed issues with COCA in terms of
representativity (and where the data
shows that COCA is actually better than the BNC), and then ignore "the elephant in the room" -- the fact
that as
Egbert, Larsson, and Biber (2020) and
Egbert, Biber, and
Gray (2022) were being written -- the BNC was completely missing
material from an entire genre, and one that has been a huge part of
people's language input for the previous 20-25 years. Why is it that
those who claim to care so much about “representativity” have given
the BNC a "pass" on this for several decades – with only the promise
that at some point it would all be fixed? Why is there no discussion
of this incredibly important “missing genre” in these books?
And even beyond this general issue of no Web / blog texts in the
BNC, there is the issue of "missing lexis" in other genres
(magazines, newspapers, etc.) for the last 20-30 years. Go ahead and
search the BNC for words (such as those shown in the first list
above) relating to technology or societal and cultural change since
the 1990s in the corpus as a whole. For example, 44 of these 52 words
are more common in COCA than the BNC, and some words (such as
blog, website, email, smartphone, download, tweet, posting, blogging,
e-book, geek, timeline, app, browser) are at least 20 times as
frequent (per million words) in COCA, or don't occur in the BNC at
all.
As long as researchers simply limit themselves to
grammatical differences between genres (which is the approach of
many “experts” in
corpus design), there's no problem. But if they
were to look at lexis (words) as well, then all of the sudden the
BNC massively fails to be "representative" of the language during
the last 20-30 years. And yes, words are part of the language too.
BNC 2014 (much more
information available here) When the 2014 update to
the written portion of the BNC was being designed, there were
(finally) plans to include texts from the Web. Unfortunately, things
haven't really progressed as expected.
1. First, BNC 2014 Written wasn't released until 2021 --
seven years after the "2014" date in its title, and almost 30 years
after BNC 1994.
2. The "Web texts" portion of BNC 2014 Written is very small
-- less than 5 million words in size (see Section 1
here). COCA has about 50 times as much data (250 million words)
from web texts. In fact, in BNC 20214 Written, there is more data
from "annual business reports" than there is from Web texts.
Assuming that a corpus should "represent" what people see and hear
in the "real world", this design is a real problem.
3. BNC Written 20214 is
only available via LancsBox,
a piece of proprietary software that must be installed on your own
machine. LancsBox X has progressed markedly from Versions 1
and 2 (released in 2021 and 2022), where it was impossible to find collocates, and
where users were limited mostly to just seeing the word or phrase in
concordance lines. The newest version (3.0) has some nice features
(include GraphColl for collocates), but (even if it is installed on
a fairly robust computer), it is still much, much slower than COCA
(which is 10 times the size of BNC 2014). (See Section 5
here)
4. While it is possible in LancsBox to limit searches to one of the
eight "macro-registers" (Web, spoken, fiction, etc), it is still
not possible to limit searches to parts of the Web. In COCA, on
the other hand, you can limit searches to one of the 22
categories
that have been
assigned to Web texts (such as Web-Reviews or Blog-Personal). In
addition, in COCA you can quickly and easily create
Virtual
Corpora for specific websites or topics. Nothing like that is
possible in LancsBox.
5. Perhaps most importantly, there is still (as of January 2024)
limited information about what is in the Web portion of BNC 2014
Written. In the spreadsheet that lists the texts in the corpus,
there are no links to the original texts online (for about 80% of
the content from web texts), or even any title for any of the web
pages. As
Egbert, Larsson, and
Biber (2020:8-14) note, one of the very first things that should be done
when a corpus is released is to provide documentation on what is in
the corpus. Detailed information about each and every text in COCA
has been available since the day it was first released in 2008, and
for every update since then (see the
data for the 2019 update).
In summary, while BNC Written 2014
(released in 2021) may (finally) have some texts from the Web, it is
still much smaller than COCA (5 million words in the BNC, compared
to 250 million in COCA). In addition, it can only be accessed on
your own computer with the LancsBox program, which is much, much
slower than COCA, and which does not allow you to search specific
portions of the web texts. There is also relatively little
information about the web texts in the BNC, compared to what is
available for COCA. |