Most corpora want to show what's going on with the informal, more "spoken" variety of a language, as opposed to (or at least in addition to) more formal fiction, newspapers, magazines, or academic writing. This is hard to do, however, since it is very time-consuming and expensive to create a large corpus of the spoken language, because of the effort in recording, transcribing, and then annotating the texts.

As a result, spoken corpora tend to be quite small. For English, for example, the MICASE, CALLHOME and CALLFRIEND corpora are all between about 1 and 2 million words. This might be adequate for extremely high frequency phenomena (e.g. modals and other auxiliary verbs), but it is far too small to look carefully at medium and lower-frequency words, for example

The British National Corpus (our interface) is perhaps the only corpus that has a large amount of everyday conversation -- about 5 million words of text (plus 5 million more in the partially available 2014 update). But the BNC is almost a "once-off" type of corpus, since large institutional funding (e.g. millions of dollars from Oxford University Press) and staffing (a large number of people in the corpus creation team) isn't something that most corpora can tap into. In addition, even though the conversational portion of the BNC corpus is now 10 million words (with the 2014 update), that is still 33x as small as the TV Corpus.

The Corpus of Contemporary American English (COCA) is much larger and more recent than these other corpora. COCA contains 110+ million words of spoken English -- 5 million words each year from 1990 to the present. These transcripts are for unscripted conversation on TV and radio programs like Good Morning American, the Today Show, All Things Considered, and Oprah. Unfortunately, the conversations often don't deal with "everyday" topics, but rather they often deal with politics, the economy, science, business, entertainment personalities, and other current events.

Some researchers have hit upon an interesting approach. In projects like SUBTLEXus, rather than using transcriptions of actual recorded speech, they use data from subtitles of movies and TV, on the theory that the dialogue in most TV shows and movies represents the spoken language really well. (For examples of this research, see A, B, and C). As this research shows, the data from subtitles agrees with native speaker intuitions about their language even better than the data from actual everyday conversation (like in the BNC).

Our Movies Corpus is based on texts / data that is very similar to SUBTLEXus. (And if you're interested in using scripts, you might also consider the 100 million word Soap Opera corpus). The Movies Corpus, however, allows you to do more than just search for the frequency of a specific word in a wordlist. As with all of the BYU corpora, the TV Corpus allows you to:

  • search for by word, lemma, part of speech

  • see collocates and re-sortable concordance lines

  • create Virtual Corpora (e.g. all dramas from the US in 2000-2009 that were rated R)

  • compare the frequency of words over time and between dialects

In summary, the overall value of the TV Corpus (in terms of very informal language) is justified by previous research (see above). Additional justification comes from more anecdotal data for words and phrases (compared to the spoken part of the BNC). But the TV Corpus provides a "full-corpus", rather than just a word list (as with SUBTLEXus). And finally, the TV Corpus is much larger than small spoken corpora like the BNC -- 20x-40x as much data (and yet equally as informal).