In March 2020, there were several important changes, improvements, and additions to COCA. They include the following:

1. New genres. COCA has been expanded from the existing five genres of spoken, fiction, magazine, newspaper, and academic. The three new genres are blogs, other web pages, and TV and movie subtitles.

1.1. TV and movie subtitles (64 million words from TV shows and 64 million words from movies, 1990-2019), from the American part of the TV and Movies corpora. As is discussed more fully at those corpora, these subtitles are as informal (or more informal) than actual spoken data, and psycholinguistic research shows that the TV and movie subtitles often reflect better speaker's intuitions about "frequent / core" vocabulary better than actual spoken data. In addition, all of the 24,000 TV and movie texts are linked to their IMDB entry. This means that you can see rich metadata for each of these texts (see examples from TV and movies), and you can also create Virtual Corpora using this metadata (example: year, genre, words in the plot description, etc).

1.2 The blog data (125 million words, from the American portion of GloWbE) is also very informal, and it and other web pages (130 million words, also from the American part of GloWbE) represent the type of language that most people encounter everyday as they surf the Web, but which has not been represented in the current version of COCA. In addition, all 188,000 of the web pages have been categorized using advanced Machine Learning algorithms by Serge Sharoff (the full definitions of the genre categories are in http://corpus.leeds.ac.uk/serge/webgenres/defs/annot-v6.pdf), and you can use that genre classification (promotion, instructional, review, etc) as you create Virtual Corpora.

1.3 Overall, these three new genres provide data on very informal, everyday language, as well as "Internet language", both of which have only partially been available in COCA before this point. No other large corpus of English has such a wide of a range of genres.

2. Size. With 120-130 million words in each of the three new genres (as well as in each of the five existing genres), COCA is now 1.0 billion words in size -- nearly twice as large as the previous version.

3. More recent texts. The corpus contains 32 million words of data from 2018 and 32 million words from 2019 (up through December 2019), with 4 million words in each of the 8 genres each year. No other genre-balanced corpus of English has such up-to-date texts.

4. Extensive information for the top 60,000 words. As with iWeb (see pages 4-8 in the overview), there are extensive information for each of the top 60,000 words (lemmas) in the corpus:

  • pronunciation, images, and videos (from YouGlish, Google images, etc) for each word

  • translation of the word into 50-60 different languages, from Google Translate (or Baidu translate in China)

  • frequency data: in each of the 10 genres, range (in what percentage of the 500,000+ texts does the word appear), and dispersion (how evenly the word is "spread" across the corpus)

  • related words (e.g. wisdom, wizard, wisely, and wisecrack for wise) and the frequency of each form of the lemma (e.g. break, breaks, broke, broken, and breaking for break)

  • synonyms and (from WordNet) hypernyms (words with a more general meaning) and hyponyms (words with a more specific meaning)

  • collocates: grouped by part of speech (unlike the current COCA) -- like the collocates (from "word sketches") in Sketch Engine

  • topics: other words that co-occur (anywhere in the 500,000+ texts). In many cases, these provide even better insight into word meaning than collocates

  • clusters: the top 2, 3, and 4 word strings for a given word (and can filter by how "meaningful" the strings are)

  • concordance lines: 1000 sample lines for each word, color-coded for part of speech of surrounding words

  • a "home page" for each of the 60,000 words, which includes abbreviated information from the preceding, and then links to the full pages discussed above

  • FAST: all of the preceding data is "pre-processed", meaning that you can see a wealth of information for each word in less than one second

5. Browse and search. As in iWeb (see page 3 in the overview), browse through and search the top 60,000 words -- by word form, part of speech, frequency, and even pronunciation, and then (for each word in the list) link to the information in #4 above.

6. Downloadable data. In addition, there are also new full-text, word frequency, collocates, and n-grams data from the new version of COCA !