Download complete list of all 75,804 texts, with metadata
The TV Corpus is composed of 325 million words in
75,804 texts from the 1950s to the 2010s (the last texts are from 2017).
The following table shows the number of words by country and
decade. (Note that MISC means that the first country listed in IMDB was
not one of the size shown below, although in most cases one of these
countries is listed as an "additional country".)
The texts were taken from the OpenSubtitles collection. In cases where there were multiple subtitles files for a given TV episode (which was the norm), we used the "highest ranked" file, in terms of accuracy (from the ratings at OpenSubtitles). We then matched up each episode (and series) with the corresponding page from IMDB, which provides rich metadata for these TV shows (and which can be used to create your own Virtual Corpus).