Most corpora want to show what's going on with the informal, more "spoken" variety of a language, as opposed to (or at least in addition to) more formal fiction, newspapers, magazines, or academic writing. This is hard to do, however, since it is very time-consuming and expensive to create a large corpus of the spoken language, because of the effort in recording, transcribing, and then annotating the texts.

As a result, spoken corpora tend to be quite small. For English, for example, the MICASE, CALLHOME and CALLFRIEND corpora are all between about 1 and 2 million words. This might be adequate for extremely high frequency phenomena (e.g. modals and other auxiliary verbs), but it is far too small to look carefully at medium and lower-frequency words, for example

The British National Corpus (BNC interface) is perhaps the only corpus that has a large amount of everyday conversation -- about 5 million words of text (plus 5 million more in the partially available 2014 update). But the BNC is almost a "once-off" type of corpus, since large institutional funding (e.g. millions of dollars from Oxford University Press) and staffing (a large number of people in the corpus creation team) isn't something that most corpora can tap into. In addition, even though the conversational portion of the BNC corpus is now 10 million words (with the 2014 update), that is still about 12 times as small as the TV/Movies data in COCA.

Some researchers have hit upon an interesting approach. In projects like SUBTLEXus, rather than using transcriptions of actual recorded speech, they use data from subtitles of movies and TV, on the theory that the dialogue in most TV shows and movies represents the spoken language really well. (For examples of this research, see A, B, and C). As this research shows, the data from subtitles agrees with native speaker intuitions about their language even better than the data from actual everyday conversation (like in the BNC).

Our TV and Movies data is based on texts / data that is very similar to SUBTLEXus. The Movies Corpus, however, allows you to do more than just search for the frequency of a specific word in a wordlist. As with all of the corpora froom, the TV Corpus allows you to:

  • search for by word, lemma, part of speech

  • see collocates and re-sortable concordance lines

  • create Virtual Corpora (e.g. all dramas from the US in 2000-2009 that were rated R)

  • compare the frequency of words over time and between dialects

In summary, the overall value of the TV and Movies corpora (in terms of very informal language) is justified by previous research (see above). But the TV Corpus provides a "full-corpus", rather than just a word list (as with SUBTLEXus). And finally, the TV Corpus is much larger than small spoken corpora like the BNC -- much more data, and yet equally as informal.