Download detailed list of all 140,703 texts

The corpus is composed of more than 475 million words of text in more than 100,000 individual texts. The major sources for each genre are as follows:

TV/Movies	From Open Subtitles
Fiction	Project Gutenberg (1810-1930), Making of America (1810-1900), scanned books (1930-1990), movie and play scripts, COCA (1990-2019)
Magazine	Making of America (1810-1900), scanned and PDF (1900-1990), COCA (1990-2019), NOW (2010-2019) - In each decade, the magazines are balanced across at least ten magazines (with equivalent sub-genres for the 1900s)
Newspaper	PDF > TXT from historical archives of newspapers (1850-1980), COCA etc (1990-2019), NOW (2010-2019)
Non-fiction	Project Gutenberg (1810-1900), www.archive.org (1810-1900), scanned books (1900-1990), COCA (1990-2010) - In each decade, the non-fiction is balanced across the Library of Congress classification system

The corpus is balanced by genre across the decades. For example, fiction+ TV/Movies (which are scripted, similar to fiction) accounts for 54-57% of the total in each decade (1820s-2000s), and the corpus is balanced across decades for sub-genres and domains as well (e.g. by Library of Congress classification for non-fiction; and by sub-genre for fiction -- prose, poetry, drama, etc). This balance across genres and sub-genres allows researchers to examine changes and be reasonably certain that the data reflects actual changes in the "real world", rather than just being artifacts of a changing genre balance.

Download all 115,000 texts, for use on your own computer.

DECADE	TV/MOVIES	FICTION	POPULAR MAGAZINES	NEWSPAPERS	NON-FICTION BOOKS	TOTAL	% FICTION	% FIC (TV/MOV)
1820s		3,778,554	1,730,991		1,471,844	6,981,389	54.1%	54.1%
1830s		7,492,464	3,158,784		3,060,039	13,711,287	54.6%	54.6%
1840s		8,615,569	3,584,996		3,606,482	15,807,047	54.5%	54.5%
1850s		9,175,764	4,243,240		3,116,999	16,536,003	55.5%	55.5%
1860s		9,279,356	4,469,061	264,549	2,923,594	16,936,560	54.8%	54.8%
1870s		10,454,445	4,489,558	1,037,506	2,806,958	18,788,467	55.6%	55.6%
1880s		11,204,077	4,489,787	1,368,195	3,005,146	20,067,205	55.8%	55.8%
1890s		11,261,720	4,716,870	1,392,895	3,055,298	20,426,783	55.1%	55.1%
1900s		12,096,794	5,092,923	1,464,317	3,323,216	21,977,250	55.0%	55.0%
1910s		12,266,683	5,727,265	1,548,986	3,560,164	23,103,098	53.1%	53.1%
1920s		12,668,146	6,132,368	3,642,022	3,257,886	25,700,422	49.3%	49.3%
1930s	2,994,552	11,959,731	6,048,788	3,651,557	3,053,251	27,707,879	43.2%	54.0%
1940s	2,888,109	12,030,426	5,841,711	3,567,355	3,072,149	27,399,750	43.9%	54.4%
1950s	4,003,700	12,014,411	5,961,487	3,570,759	3,110,917	28,661,274	41.9%	55.9%
1960s	4,991,777	11,652,761	5,884,357	3,434,402	3,159,379	29,122,676	40.0%	57.2%
1970s	4,982,750	11,652,921	5,767,353	3,402,783	3,023,418	28,829,225	40.4%	57.7%
1980s	4,995,510	11,664,130	5,980,459	4,125,888	3,085,593	29,851,580	39.1%	55.8%
1990s	5,039,323	13,337,688	7,552,995	4,069,742	3,149,570	33,149,318	40.2%	55.4%
2000s	5,078,058	14,624,639	7,813,604	4,116,506	3,189,005	34,821,812	42.0%	56.6%
2010s	5,061,478	15,150,555	7,552,354	4,546,307	3,142,112	35,452,806	42.7%	57.0%
TOTAL	40,035,257	222,380,834	106,238,951	45,203,769	61,173,020	475,031,831

The following are the changes that were made in the 2021 update:

1. A subset of the texts from the Movies and TV corpora were added to the corpus, to provide access to much more informal language.

2. Texts from 2010-2019 were added, to bring the corpus closer to the present.

3. The texts from 1810-1819 were dropped. These had fairly poor genre balance and there were only about one million words of data. This is much more than made up for by the 35+ million words from the recent texts from 2010-2019.

4. The textual corpus was cleaned and corrected by researchers from the Univ of Stuttgart.

5. Corrections were made to the metadata for a number of texts, based on feedback from users. For example, a text that originally had a date of 1905 but was actually from the 1950s now has the correct date. This should prevent more "false hits" for words and phrases that seemed to be from an earlier period, but were not.

6. Many duplicate texts were removed.