Download complete list of all 75,804 texts, with metadata

The TV Corpus is composed of 325 million words in 75,804 texts from the 1950s to the 2010s (the last texts are from 2017). The following table shows the number of words by country and decade. (Note that MISC means that the first country listed in IMDB was not one of the size shown below, although in most cases one of these countries is listed as an "additional country".)

US / CA UK / IE AU / NZ MISC TOTAL

1950s 2,012,631 20,740 - - 2,033,371

1960s 6,728,110 2,168,841 - 5,727 8,902,678

1970s 5,717,836 3,063,468 - - 8,781,304

1980s 11,905,793 3,054,673 49,263 1,814 15,011,543

1990s 26,825,820 4,373,746 78,769 228,645 31,506,980

2000s 71,570,270 14,511,570 997,291 464,778 87,543,909

2010s 141,039,715 25,959,596 4,015,203 1,406,977 172,421,491

TOTAL 265,800,175 53,152,634 5,140,526 2,107,941 326,201,276

The texts were taken from the OpenSubtitles collection. In cases where there were multiple subtitles files for a given TV episode (which was the norm), we used the "highest ranked" file, in terms of accuracy (from the ratings at OpenSubtitles). We then matched up each episode (and series) with the corresponding page from IMDB, which provides rich metadata for these TV shows (and which can be used to create your own Virtual Corpus).