Download spreadsheet for all 485,179 texts (with summary by year, genre, and sub-genre)

The corpus is composed of more than one billion words in 485,202 texts, including 24-25 million words each year from 1990-2019. For each year (and therefore overall, as well), the corpus is evenly divided between the genres of TV and Movies subtitles, spoken, fiction, popular magazines, newspapers, and academic journals. This is important, because if you want to compare different years, you need to be comparing "apples" to "apples" (i.e. same genre balance in the different periods).

YEAR	BLOG	WEB	TV / MOVIES	SPOKEN	FICTION	MAGAZINE	NEWSPAPER	ACADEMIC	TOTAL
TOTAL	125,496,215	129,899,426	128,013,334	127,396,916	119,505,292	127,352,014	122,959,393	120,988,348	1,001,610,938
1990			3,207,900	4,374,469	4,162,242	4,101,447	4,082,931	3,983,143	23,912,132
1991			3,379,151	4,316,898	4,192,646	4,209,838	4,104,806	4,051,046	24,254,385
1992			3,183,858	4,523,054	3,893,956	4,288,694	4,092,031	4,028,147	24,009,740
1993			3,785,924	4,487,978	3,921,244	4,254,351	4,153,070	4,150,671	24,753,238
1994			4,375,338	4,457,726	3,870,757	4,310,375	4,147,947	4,047,115	25,209,258
1995			5,006,966	4,548,602	3,846,412	4,314,737	4,122,703	4,016,371	25,855,791
1996			4,384,976	4,095,266	3,758,787	4,338,766	4,099,305	4,110,209	24,787,309
1997			4,380,670	3,904,996	3,617,741	4,368,917	4,153,906	4,420,786	24,847,016
1998			4,390,197	4,446,217	3,779,801	4,393,835	4,122,295	4,111,453	25,243,798
1999			4,381,144	4,445,564	4,154,537	4,391,146	4,107,423	4,023,282	25,503,096
YEAR	BLOG	WEB	TV / MOVIES	SPOKEN	FICTION	MAGAZINE	NEWSPAPER	ACADEMIC	TOTAL
2000			4,385,593	4,455,815	3,942,474	4,387,935	4,037,086	4,093,991	25,302,894
2001			4,389,164	4,026,240	3,894,789	4,298,636	4,072,447	3,965,654	24,646,930
2002			4,384,475	4,372,290	3,766,673	4,310,634	4,114,280	4,054,359	25,002,711
2003			4,386,799	4,445,270	4,125,039	4,332,708	4,056,245	4,047,802	25,393,863
2004			4,378,535	4,359,084	4,099,691	4,337,309	4,121,117	4,009,359	25,305,095
2005			4,382,594	4,438,877	4,101,737	4,364,776	4,124,225	3,925,927	25,338,136
2006			4,369,684	4,345,995	4,113,173	4,302,713	4,120,732	4,019,200	25,271,497
2007			4,384,406	3,914,424	4,063,116	4,225,511	4,002,299	4,303,993	24,893,749
2008			4,376,702	3,467,315	4,147,216	4,289,641	4,021,006	3,977,790	24,279,670
2009			4,360,676	3,942,512	4,072,580	3,972,290	3,956,523	3,975,128	24,279,709
YEAR	BLOG	WEB	TV / MOVIES	SPOKEN	FICTION	MAGAZINE	NEWSPAPER	ACADEMIC	TOTAL
2010			4,386,795	4,097,760	3,897,459	3,832,576	4,226,666	3,838,637	24,279,893
2011			4,366,464	4,706,635	4,165,068	4,194,966	3,941,853	4,474,072	25,849,058
2012			4,379,595	4,411,281	3,862,889	4,306,912	4,126,669	4,384,263	25,471,609
2013			4,379,396	3,986,106	4,256,880	4,190,854	4,106,654	3,559,748	24,479,638
2014			4,380,134	3,850,683	4,172,260	4,264,503	4,140,151	3,476,429	24,284,160
2015			4,377,018	3,980,660	4,218,823	4,205,807	4,108,436	3,638,406	24,529,150
2016			4,380,381	4,168,303	3,258,473	4,053,156	4,059,857	3,968,779	23,888,949
2017			4,384,822	4,225,248	3,940,337	4,212,809	4,154,518	4,052,435	24,970,169
2018			4,353,912	4,300,990	4,109,362	4,143,311	4,158,845	4,200,047	25,266,467
2019			4,350,065	4,300,658	4,099,130	4,152,861	4,123,367	4,080,106	25,106,187

The texts come from a variety of sources:

TV/Movies subtitles: (128 million words [128,013,334]). These come from the American part of the TV and Movies corpora. These subtitles are as informal (or more informal) than actual spoken data. The texts were taken from the OpenSubtitles collection. In cases where there were multiple subtitles files for a given TV episode (which was the norm), we used the "highest ranked" file, in terms of accuracy (from the ratings at OpenSubtitles). We then matched up each episode (and series) with the corresponding page from IMDB, which provides rich metadata for these TV shows (and which can be used to create your own Virtual Corpus).
Spoken: (127 million words [127,396,916]) Transcripts of unscripted conversation from more than 150 different TV and radio programs (examples: All Things Considered (NPR), Newshour (PBS), Good Morning America (ABC), Today Show (NBC), 60 Minutes (CBS), Hannity and Colmes (Fox), Jerry Springer, etc). [See notes on the naturalness and authenticity of the language from these transcripts).
Fiction: (120 million words [119,505,292]) Short stories and plays from literary magazines, children’s magazines, popular magazines, first chapters of first edition books 1990-present, and movie scripts.
Popular Magazines: (127 million words [127,352,014]) Nearly 100 different magazines, with a good mix (overall, and by year) between specific domains (news, health, home and gardening, women, financial, religion, sports, etc). A few examples are Time, Men’s Health, Good Housekeeping, Cosmopolitan, Fortune, Christian Century, Sports Illustrated, etc.
Newspapers: (123 million words [122,959,393]) Ten newspapers from across the US, including: USA Today, New York Times, Atlanta Journal Constitution, San Francisco Chronicle, etc. In most cases, there is a good mix between different sections of the newspaper, such as local news, opinion, sports, financial, etc.
Academic Journals: (121 million words [120,988,348]) Nearly 100 different peer-reviewed journals. These were selected to cover the entire range of the Library of Congress classification system (e.g. a certain percentage from B (philosophy, psychology, religion), D (world history), K (education), T (technology), etc.), both overall and by number of words per year.
Blogs: (125 million words [125,496,215]). These texts represent a subset of the texts from the United States in the GloWbE corpus. At that time, Google allowed searches to be restricted to blogs, so nearly all of these texts are actually blogs.
Web pages: (130 million words [129,899,426]). They represent a subset of the "General" texts from the United States in the GloWbE corpus. Some of these texts are actually blogs (there was no way to search "NOT blogs" in Google at that time).
-- More information on how these texts were obtained from Google searches, and how the texts were identified as being from the US.
-- For both blogs and general web pages, these were subsequently categorized by Serge Sharoff, so that in COCA you can limit searches to a particular web genre. (The full definitions of the genre categories are in http://corpus.leeds.ac.uk/serge/webgenres/defs/annot-v6.pdf)
-- Note that these texts were all collected in Oct 2012, so they are more of a "snapshot" of this genre, rather than year by year (as above). As a result, they are not included in the "historical" data, when you compare the frequency across decades or year. All historical data comes from the other six genres listed above.