In March 2020, there were several important
changes, improvements, and additions to COCA. They include the
following:
1. New genres. COCA has been expanded from the existing five genres
of spoken, fiction, magazine, newspaper, and academic. The three new
genres are blogs, other web pages, and TV and movie subtitles.
1.1.
TV and movie subtitles (64 million words
from
TV shows and 64 million words from movies, 1990-2019), from the American part of the
TV
and
Movies corpora. As is discussed more fully at those corpora, these subtitles are
as informal (or more informal) than
actual spoken data, and psycholinguistic research shows that the TV and
movie subtitles often reflect better speaker's intuitions about "frequent
/ core" vocabulary better than actual spoken data. In addition,
all of the 24,000 TV and movie texts are linked to their IMDB entry.
This means that you can see rich metadata for each of these texts
(see examples from
TV and
movies),
and you can also create Virtual Corpora using this metadata (example:
year, genre, words in the plot description, etc).
1.2 The
blog data (125
million words, from the American portion of
GloWbE) is also very informal, and it and other
web pages (130
million words, also from the American part of
GloWbE) represent the type of language that most people
encounter everyday as they surf the Web, but which has not been
represented in the current version of COCA. In addition, all 188,000
of the web pages have been
categorized using advanced Machine
Learning algorithms by
Serge Sharoff (the full definitions of the genre categories are
in http://corpus.leeds.ac.uk/serge/webgenres/defs/annot-v6.pdf), and
you can use that genre classification (promotion, instructional,
review, etc) as you create
Virtual Corpora.
1.3 Overall, these
three
new genres provide data on very informal, everyday language, as
well as "Internet language", both of which have only partially been available in COCA
before this
point. No other large corpus of English has such a wide of a
range of genres.
2. Size. With 120-130 million words in each of the three new genres
(as well as in each of the five existing genres), COCA is now 1.0 billion words in size --
nearly twice as large as the previous version.
3. More recent texts. The corpus contains 32 million words of
data from 2018 and 32 million words from 2019 (up through December
2019), with 4 million words in each of the 8 genres each year. No other
genre-balanced corpus of English has such up-to-date texts.
4. Extensive information for the top 60,000 words. As with
iWeb
(see pages 4-8 in the
overview), there are extensive information for each of the top
60,000 words (lemmas) in the corpus:
-
pronunciation, images,
and videos (from YouGlish, Google images, etc) for each word
-
translation of the word
into 50-60 different languages, from Google Translate (or Baidu
translate in China)
-
frequency data: in each
of the 10 genres, range (in what percentage of the 500,000+ texts
does the word appear), and dispersion (how evenly the word is
"spread" across the corpus)
-
related words (e.g.
wisdom, wizard, wisely, and wisecrack for wise)
and the frequency of each form of the lemma (e.g. break, breaks,
broke, broken, and breaking for break)
-
synonyms and (from
WordNet) hypernyms (words with a more general meaning) and hyponyms
(words with a more specific meaning)
-
collocates: grouped by
part of speech (unlike the current COCA) -- like the collocates
(from "word sketches") in Sketch Engine
-
topics: other words
that co-occur (anywhere in the 500,000+ texts). In many cases, these
provide even better insight into word meaning than collocates
-
clusters: the top 2, 3,
and 4 word strings for a given word (and can filter by how
"meaningful" the strings are)
-
concordance lines: 1000
sample lines for each word, color-coded for part of speech of
surrounding words
-
a "home page" for each
of the 60,000 words, which includes abbreviated information from the
preceding, and then links to the full pages discussed above
-
FAST: all of the
preceding data is "pre-processed", meaning that you can see a wealth
of information for each word in less than one second
5. Browse and search.
As in iWeb (see page 3 in the
overview), browse through and search the top 60,000 words -- by word form, part of speech, frequency,
and even pronunciation, and then (for each word in the list) link to the
information in #4 above.
6. Downloadable data.
In addition, there are also new
full-text,
word frequency,
collocates,
and n-grams data
from the new version of COCA ! |