Download spreadsheet for all 485,179
texts (with summary by year, genre, and sub-genre)
The corpus is composed of more than one billion words in 485,202 texts, including 24-25 million words each year from 1990-2019.
For each year (and
therefore overall, as well), the
corpus is evenly divided between the genres of TV and Movies subtitles, spoken, fiction, popular magazines, newspapers,
and academic
journals. This is important, because if you want to compare different
years, you need to be comparing "apples" to "apples" (i.e. same genre
balance in the different periods).
YEAR |
BLOG |
WEB |
TV / MOVIES |
SPOKEN |
FICTION |
MAGAZINE |
NEWSPAPER |
ACADEMIC |
TOTAL |
TOTAL |
125,496,215 |
129,899,426 |
128,013,334 |
127,396,916 |
119,505,292 |
127,352,014 |
122,959,393 |
120,988,348 |
1,001,610,938 |
1990 | | | 3,207,900 |
4,374,469 | 4,162,242 |
4,101,447 | 4,082,931 |
3,983,143 | 23,912,132 |
1991 | | | 3,379,151 |
4,316,898 | 4,192,646 |
4,209,838 | 4,104,806 |
4,051,046 | 24,254,385 |
1992 | | | 3,183,858 |
4,523,054 | 3,893,956 |
4,288,694 | 4,092,031 |
4,028,147 | 24,009,740 |
1993 | | | 3,785,924 |
4,487,978 | 3,921,244 |
4,254,351 | 4,153,070 |
4,150,671 | 24,753,238 |
1994 | | | 4,375,338 |
4,457,726 | 3,870,757 |
4,310,375 | 4,147,947 |
4,047,115 | 25,209,258 |
1995 | | | 5,006,966 |
4,548,602 | 3,846,412 |
4,314,737 | 4,122,703 |
4,016,371 | 25,855,791 |
1996 | | | 4,384,976 |
4,095,266 | 3,758,787 |
4,338,766 | 4,099,305 |
4,110,209 | 24,787,309 |
1997 | | | 4,380,670 |
3,904,996 | 3,617,741 |
4,368,917 | 4,153,906 |
4,420,786 | 24,847,016 |
1998 | | | 4,390,197 |
4,446,217 | 3,779,801 |
4,393,835 | 4,122,295 |
4,111,453 | 25,243,798 |
1999 | | | 4,381,144 |
4,445,564 | 4,154,537 |
4,391,146 | 4,107,423 |
4,023,282 | 25,503,096 |
YEAR |
BLOG |
WEB |
TV / MOVIES |
SPOKEN |
FICTION |
MAGAZINE |
NEWSPAPER |
ACADEMIC |
TOTAL |
2000 | | | 4,385,593 |
4,455,815 | 3,942,474 |
4,387,935 | 4,037,086 |
4,093,991 | 25,302,894 |
2001 | | | 4,389,164 |
4,026,240 | 3,894,789 |
4,298,636 | 4,072,447 |
3,965,654 | 24,646,930 |
2002 | | | 4,384,475 |
4,372,290 | 3,766,673 |
4,310,634 | 4,114,280 |
4,054,359 | 25,002,711 |
2003 | | | 4,386,799 |
4,445,270 | 4,125,039 |
4,332,708 | 4,056,245 |
4,047,802 | 25,393,863 |
2004 | | | 4,378,535 |
4,359,084 | 4,099,691 |
4,337,309 | 4,121,117 |
4,009,359 | 25,305,095 |
2005 | | | 4,382,594 |
4,438,877 | 4,101,737 |
4,364,776 | 4,124,225 |
3,925,927 | 25,338,136 |
2006 | | | 4,369,684 |
4,345,995 | 4,113,173 |
4,302,713 | 4,120,732 |
4,019,200 | 25,271,497 |
2007 | | | 4,384,406 |
3,914,424 | 4,063,116 |
4,225,511 | 4,002,299 |
4,303,993 | 24,893,749 |
2008 | | | 4,376,702 |
3,467,315 | 4,147,216 |
4,289,641 | 4,021,006 |
3,977,790 | 24,279,670 |
2009 | | | 4,360,676 |
3,942,512 | 4,072,580 |
3,972,290 | 3,956,523 |
3,975,128 | 24,279,709 |
YEAR |
BLOG |
WEB |
TV / MOVIES |
SPOKEN |
FICTION |
MAGAZINE |
NEWSPAPER |
ACADEMIC |
TOTAL |
2010 | | | 4,386,795 |
4,097,760 | 3,897,459 |
3,832,576 | 4,226,666 |
3,838,637 | 24,279,893 |
2011 | | | 4,366,464 |
4,706,635 | 4,165,068 |
4,194,966 | 3,941,853 |
4,474,072 | 25,849,058 |
2012 | | | 4,379,595 |
4,411,281 | 3,862,889 |
4,306,912 | 4,126,669 |
4,384,263 | 25,471,609 |
2013 | | | 4,379,396 |
3,986,106 | 4,256,880 |
4,190,854 | 4,106,654 |
3,559,748 | 24,479,638 |
2014 | | | 4,380,134 |
3,850,683 | 4,172,260 |
4,264,503 | 4,140,151 |
3,476,429 | 24,284,160 |
2015 | | | 4,377,018 |
3,980,660 | 4,218,823 |
4,205,807 | 4,108,436 |
3,638,406 | 24,529,150 |
2016 | | | 4,380,381 |
4,168,303 | 3,258,473 |
4,053,156 | 4,059,857 |
3,968,779 | 23,888,949 |
2017 | | | 4,384,822 |
4,225,248 | 3,940,337 |
4,212,809 | 4,154,518 |
4,052,435 | 24,970,169 |
2018 | | | 4,353,912 |
4,300,990 | 4,109,362 |
4,143,311 | 4,158,845 |
4,200,047 | 25,266,467 |
2019 | | | 4,350,065 |
4,300,658 | 4,099,130 |
4,152,861 | 4,123,367 |
4,080,106 | 25,106,187 |
The texts come from a variety of sources:
-
TV/Movies subtitles: (128 million words
[128,013,334]). These come from the American part of the
TV
and
Movies corpora. These subtitles are
as informal (or more informal) than
actual spoken data. The texts were taken from the
OpenSubtitles collection. In cases where there were multiple
subtitles files for a given TV episode (which was the norm), we used the
"highest ranked" file, in terms of accuracy (from the ratings at
OpenSubtitles). We then matched up each episode (and series) with
the corresponding page from
IMDB, which provides
rich metadata for these TV shows (and which can be used to create your
own Virtual Corpus).
-
Spoken: (127 million words
[127,396,916]) Transcripts of unscripted
conversation from more than 150 different TV and radio programs
(examples: All Things Considered (NPR), Newshour (PBS),
Good Morning America (ABC), Today Show (NBC), 60 Minutes
(CBS), Hannity and Colmes (Fox), Jerry Springer, etc).
[See notes on the naturalness and
authenticity of the language from these transcripts).
-
Fiction: (120 million words
[119,505,292]) Short stories and plays
from literary magazines, children’s magazines, popular magazines, first
chapters of first edition books 1990-present, and movie scripts.
-
Popular Magazines: (127 million
words [127,352,014]) Nearly 100
different magazines, with a good mix (overall, and by year) between
specific domains (news, health, home and gardening, women, financial,
religion, sports, etc). A few examples are Time,
Men’s Health, Good Housekeeping, Cosmopolitan, Fortune, Christian
Century, Sports Illustrated, etc.
-
Newspapers: (123 million words
[122,959,393]) Ten newspapers from
across the US, including: USA Today, New York Times, Atlanta Journal
Constitution, San Francisco Chronicle, etc. In most cases, there is a good
mix between different sections of the newspaper, such as local news,
opinion, sports, financial, etc.
-
Academic Journals: (121 million words
[120,988,348]) Nearly 100
different peer-reviewed journals. These were selected to cover the
entire range of the Library of Congress classification system (e.g. a
certain percentage from B (philosophy, psychology, religion), D (world
history), K (education), T (technology), etc.), both overall and by
number of words per year.
-
Blogs: (125 million words
[125,496,215]). These texts represent a subset of the texts from the
United States in the GloWbE
corpus. At that time, Google allowed searches to be restricted to blogs,
so nearly all of these texts are actually blogs.
-
Web pages: (130 million words
[129,899,426]). They represent a subset of the "General" texts from the
United States in the GloWbE
corpus. Some of these texts are actually blogs (there was no way to
search "NOT blogs" in Google at that time).
-- More information on how these texts were
obtained from Google searches, and how the texts were identified as
being from the US.
-- For both blogs and general web pages, these were subsequently
categorized by
Serge Sharoff, so that in COCA you can limit searches to a
particular web genre. (The full definitions of the genre categories are
in http://corpus.leeds.ac.uk/serge/webgenres/defs/annot-v6.pdf)
-- Note that these texts were all collected in Oct 2012, so they are
more of a "snapshot" of this genre, rather than year by year (as above).
As a result, they are not included in the "historical" data, when you
compare the frequency across decades or year. All historical data comes
from the other six genres listed above.
|