Download detailed list of all 140,703 texts

The corpus is composed of more than 475 million words of text in more than 100,000 individual texts. The major sources for each genre are as follows:

TV/Movies From Open Subtitles
Fiction Project Gutenberg (1810-1930), Making of America (1810-1900), scanned books (1930-1990), movie and play scripts, COCA (1990-2010)
Magazine Making of America (1810-1900), scanned and PDF (1900-1990), COCA (1990-2010)
- In each decade, the magazines are balanced across at least ten magazines (with equivalent sub-genres for the 1900s)
Newspaper PDF > TXT from historical archives of newspapers (1850-1980), COCA etc (1990-2010)
Non-fiction Project Gutenberg (1810-1900), www.archive.org (1810-1900), scanned books (1900-1990), COCA (1990-2010)
- In each decade, the non-fiction is balanced across the Library of Congress classification system

The corpus is balanced by genre across the decades. For example, fiction+ TV/Movies (which are scripted, similar to fiction) accounts for 54-57% of the total in each decade (1820s-2000s), and the corpus is balanced across decades for sub-genres and domains as well (e.g. by Library of Congress classification for non-fiction; and by sub-genre for fiction -- prose, poetry, drama, etc). This balance across genres and sub-genres allows researchers to examine changes and be reasonably certain that the data reflects actual changes in the "real world", rather than just being artifacts of a changing genre balance.

Download all 115,000 texts, for use on your own computer.

DECADE TV/MOVIES FICTION POPULAR
MAGAZINES
NEWSPAPERS NON-FICTION
BOOKS
TOTAL % FICTION % FIC (TV/MOV)
1810s   641,164 88,316   451,542 1,181,022    
1820s  3,778,5541,730,991  1,471,844 6,981,38954.1% 54.1%
1830s  7,492,4643,158,784  3,060,039 13,711,28754.6% 54.6%
1840s  8,615,5693,584,996  3,606,482 15,807,04754.5% 54.5%
1850s  9,175,7644,243,240  3,116,999 16,536,00355.5% 55.5%
1860s  9,279,3564,469,061 264,5492,923,594 16,936,56054.8% 54.8%
1870s  10,454,4454,489,558 1,037,5062,806,958 18,788,46755.6% 55.6%
1880s  11,204,0774,489,787 1,368,1953,005,146 20,067,20555.8% 55.8%
1890s  11,261,7204,716,870 1,392,8953,055,298 20,426,78355.1% 55.1%
1900s  12,096,7945,092,923 1,464,3173,323,216 21,977,25055.0% 55.0%
1910s  12,266,6835,727,265 1,548,9863,560,164 23,103,09853.1% 53.1%
1920s  12,668,1466,132,368 3,642,0223,257,886 25,700,42249.3% 49.3%
1930s 2,994,55211,959,731 6,048,7883,651,557 3,053,25127,707,879 43.2%54.0%
1940s 2,888,10912,030,426 5,841,7113,567,355 3,072,14927,399,750 43.9%54.4%
1950s 4,003,70012,014,411 5,961,4873,570,759 3,110,91728,661,274 41.9%55.9%
1960s 4,991,77711,652,761 5,884,3573,434,402 3,159,37929,122,676 40.0%57.2%
1970s 4,982,75011,652,921 5,767,3533,402,783 3,023,41828,829,225 40.4%57.7%
1980s 4,995,51011,664,130 5,980,4594,125,888 3,085,59329,851,580 39.1%55.8%
1990s 5,039,32313,337,688 7,552,9954,069,742 3,149,57033,149,318 40.2%55.4%
2000s 5,078,05814,624,639 7,813,6044,116,506 3,189,00534,821,812 42.0%56.6%
2010s 5,061,478 15,150,555 7,552,354 4,546,307 3,142,112 35,452,80642.7% 57.0%
TOTAL 40,035,257 222,380,834 106,238,951 45,203,769 61,173,020 475,031,831    

The following are the changes that were made in the 2021 update:

1. A subset of the texts from the Movies and TV corpora were added to the corpus, to provide access to much more informal language.

2. Texts from 2010-2019 were added, to bring the corpus closer to the present.

3. The texts from 1810-1819 were dropped. These had fairly poor genre balance and there were only about one million words of data. This is much more than made up for by the 35+ million words from the recent texts from 2010-2019.

4. The textual corpus was cleaned and corrected by researchers from the Univ of Stuttgart.

5. Corrections were made to the metadata for a number of texts, based on feedback from users. For example, a text that originally had a date of 1905 but was actually from the 1950s now has the correct date. This should prevent more "false hits" for words and phrases that seemed to be from an earlier period, but were not.

6. Many duplicate texts were removed.