Download detailed list of all 140,703 texts
The corpus is composed of more
than 475 million words of text in more than 100,000
individual texts. The major sources for each
genre are as follows:
TV/Movies |
From
Open Subtitles |
Fiction |
Project
Gutenberg (1810-1930),
Making
of America (1810-1900), scanned books
(1930-1990), movie and play scripts,
COCA
(1990-2010) |
Magazine |
Making
of America (1810-1900), scanned and PDF
(1900-1990),
COCA (1990-2010)
- In each decade, the magazines are balanced across
at least ten magazines (with equivalent sub-genres
for the 1900s) |
Newspaper |
PDF > TXT from historical archives of newspapers (1850-1980),
COCA etc
(1990-2010) |
Non-fiction |
Project
Gutenberg (1810-1900),
www.archive.org
(1810-1900), scanned books (1900-1990),
COCA
(1990-2010)
- In each decade, the non-fiction is balanced across
the
Library of Congress classification system |
The corpus is
balanced by genre across the decades. For example,
fiction+ TV/Movies (which are scripted, similar to fiction) accounts for
54-57% of the total in each decade
(1820s-2000s), and the corpus is balanced across decades for
sub-genres and domains as well (e.g. by Library of Congress
classification for non-fiction; and by sub-genre for fiction
-- prose, poetry, drama, etc). This balance across genres
and sub-genres allows researchers to examine changes and be
reasonably certain that the data reflects actual changes in
the "real world", rather than just being artifacts of a
changing genre balance.
Download all 115,000 texts, for use on your own
computer.
DECADE |
TV/MOVIES |
FICTION |
POPULAR
MAGAZINES |
NEWSPAPERS |
NON-FICTION
BOOKS |
TOTAL |
% FICTION |
% FIC (TV/MOV) |
1820s | |
3,778,554 | 1,730,991 |
| 1,471,844 |
6,981,389 | 54.1% |
54.1% |
1830s | |
7,492,464 | 3,158,784 |
| 3,060,039 |
13,711,287 | 54.6% |
54.6% |
1840s | |
8,615,569 | 3,584,996 |
| 3,606,482 |
15,807,047 | 54.5% |
54.5% |
1850s | |
9,175,764 | 4,243,240 |
| 3,116,999 |
16,536,003 | 55.5% |
55.5% |
1860s | |
9,279,356 | 4,469,061 |
264,549 | 2,923,594 |
16,936,560 | 54.8% |
54.8% |
1870s | |
10,454,445 | 4,489,558 |
1,037,506 | 2,806,958 |
18,788,467 | 55.6% |
55.6% |
1880s | |
11,204,077 | 4,489,787 |
1,368,195 | 3,005,146 |
20,067,205 | 55.8% |
55.8% |
1890s | |
11,261,720 | 4,716,870 |
1,392,895 | 3,055,298 |
20,426,783 | 55.1% |
55.1% |
1900s | |
12,096,794 | 5,092,923 |
1,464,317 | 3,323,216 |
21,977,250 | 55.0% |
55.0% |
1910s | |
12,266,683 | 5,727,265 |
1,548,986 | 3,560,164 |
23,103,098 | 53.1% |
53.1% |
1920s | |
12,668,146 | 6,132,368 |
3,642,022 | 3,257,886 |
25,700,422 | 49.3% |
49.3% |
1930s |
2,994,552 | 11,959,731 |
6,048,788 | 3,651,557 |
3,053,251 | 27,707,879 |
43.2% | 54.0% |
1940s |
2,888,109 | 12,030,426 |
5,841,711 | 3,567,355 |
3,072,149 | 27,399,750 |
43.9% | 54.4% |
1950s |
4,003,700 | 12,014,411 |
5,961,487 | 3,570,759 |
3,110,917 | 28,661,274 |
41.9% | 55.9% |
1960s |
4,991,777 | 11,652,761 |
5,884,357 | 3,434,402 |
3,159,379 | 29,122,676 |
40.0% | 57.2% |
1970s |
4,982,750 | 11,652,921 |
5,767,353 | 3,402,783 |
3,023,418 | 28,829,225 |
40.4% | 57.7% |
1980s |
4,995,510 | 11,664,130 |
5,980,459 | 4,125,888 |
3,085,593 | 29,851,580 |
39.1% | 55.8% |
1990s |
5,039,323 | 13,337,688 |
7,552,995 | 4,069,742 |
3,149,570 | 33,149,318 |
40.2% | 55.4% |
2000s |
5,078,058 | 14,624,639 |
7,813,604 | 4,116,506 |
3,189,005 | 34,821,812 |
42.0% | 56.6% |
2010s |
5,061,478 |
15,150,555 |
7,552,354 |
4,546,307 |
3,142,112 |
35,452,806 | 42.7% |
57.0% |
TOTAL |
40,035,257 |
222,380,834 |
106,238,951 |
45,203,769 |
61,173,020 |
475,031,831 |
|
|
|
The following are the changes that were made in
the 2021 update:
1. A subset of the texts from the
Movies
and TV
corpora were added to the corpus, to provide access to much more
informal language.
2. Texts from 2010-2019 were added, to bring the corpus closer to the
present.
3. The texts from 1810-1819 were dropped. These had fairly poor genre
balance and there were only about one million words of data. This is
much more than made up for by the 35+ million words from the recent
texts from 2010-2019.
4. The textual corpus was
cleaned
and corrected by researchers from the Univ of Stuttgart.
5. Corrections were made to the metadata for a number of texts, based on
feedback from users. For example, a text that originally had a date of
1905 but was actually from the 1950s now has the correct date. This
should prevent more "false hits" for words and phrases that seemed to be
from an earlier period, but were not.
6. Many duplicate texts were removed. |