Download complete list of all 25,094 texts, with metadata

The Movies Corpus is composed of 200 million words in 25,094 texts from the 1930s to the 2010s (the last texts are from 2018). The following table shows the number of words by country and decade. (Note that MISC means that the first country listed in IMDB was not one of the size shown below, although in most cases one of these countries is listed as an "additional country".)
 
  US / CA UK / IE AU / NZ Misc TOTAL
1930s 6,013,722 445,980 2,245 104,255 6,566,202
1940s 8,679,722 1,077,429 --- 51,151 9,808,302
1950s 8,570,819 1,826,174 21,777 197,173 10,615,943
1960s 5,851,067 2,687,175 6,594 557,976 9,102,812
1970s 6,972,688 2,060,309 112,715 958,968 10,104,680
1980s 10,739,129 2,153,349 308,640 917,461 14,118,579
1990s 19,259,078 2,983,322 384,607 1,986,577 24,613,584
2000s 38,572,824 6,970,252 793,610 4,893,749 51,230,435
2010s 48,649,187 8,705,479 1,337,876 4,626,223 63,318,765
TOTAL 153,308,236 28,909,469 2,968,064 14,293,533 199,479,302


The texts were taken from the OpenSubtitles collection. In cases where there were multiple subtitles files for a given movie (which was the norm), we used the "highest ranked" file, in terms of accuracy (from the ratings at OpenSubtitles). We then matched up each movie with the corresponding page from IMDB, which provides rich metadata for each movie (and which can be used to create your own Virtual Corpus).