COCA: TV and Movie subtitles (informal language)
Some researchers have employed an interesting approach that does a very good job of "modeling" spoken
language. In projects like
SUBTLEXus, rather than
using transcriptions of actual recorded speech, they use data from
subtitles of movies and TV, with the idea being that the dialogue
in most TV shows and movies represents the spoken language really well.
(For examples of this research, see
A,
B, and
C). As this research shows, the data from subtitles
agrees with native speaker intuitions about their language
even
better than the data from actual everyday conversation (like in
the BNC).
The
TV and Movies data (128 million words) in COCA is are taken from the
1990s-2010s American portion of the
TV
(325 million words) and
Movies (200 million words) corpora. This data is based on texts that
is very similar to SUBTLEXus.
One might be suspicious of dialogue from
TV shows and movies. After all, it is written
by a scriptwriter. How well does it really represent authentic, "spoken"
language? Let's take a look at this is some detail. In each case, we'll
compare the TV and Movies data with the spoken portion of the BNC. We'll see that in most cases,
the
language in the TV shows and movies is actually much more informal than the BNC.
Phrases: TV/Movies vs BNC-Spoken
(For more details and even more
carefully constructed data, see
Davies
2021)
The following table shows the raw frequency (columns 3-5) and
the frequency per million (PM) words (columns 6-8). The rightmost two column shows how
much more frequent the word is in the TV Corpus than in the
BNC-Spoken (e.g. 12.4 = more than ten times as frequent, per million
words). You can
click on any of the entries to see the actual examples
from the three corpora. For the BNC, look at the SPOKEN column of the chart. For
the movies,
look at the ALL column at the left.
|
Note: click on any link
on this page to see the corpus data, and then
click on the "BACK" image (see left) at the top of the page to come back to
this page. |
Query |
Example |
TV |
Movies |
BNC-Spok |
TV-PM |
Movies-PM |
BNC-S-PM |
TV/BNC |
Movies/BNC |
. you VERB me ? |
. You heard me?
(=subject ellipsis) |
3,491 |
2,946 |
0 |
10.7 |
14.7 |
0.0 |
107.4 |
147.3 |
, ok|okay ? |
we're leaving
now, OK? |
100,866 |
59,288 |
344 |
310.4 |
296.4 |
34.8 |
8.9 |
8.5 |
, right ? |
you're pretty
tired, right? |
111,195 |
59,081 |
274 |
342.1 |
295.4 |
27.7 |
12.4 |
10.7 |
I told you |
I told you
to get out of here |
45,899 |
31,302 |
385 |
141.2 |
156.5 |
38.7 |
3.6 |
4.0 |
DO n't get it |
I
don't get it -- why do you hate me
so much? |
9,188 |
4,847 |
89 |
28.3 |
24.2 |
9.0 |
3.1 |
2.7 |
how can you |
How can you
even say that? |
10,155 |
7,331 |
193 |
31.2 |
36.7 |
19.5 |
1.6 |
1.9 |
my God |
My God --
she's horrible! |
102,515 |
57,812 |
572 |
315.4 |
289.1 |
20.0 |
15.8 |
14.5 |
. it 's ADJ . |
. It's sad.
She's totally forgotten him. (=short
phrases) |
56,198 |
36,161 |
126 |
172.9 |
180.8 |
34.3 |
5.0 |
5.3 |
Situational (shows that the
movie scripts are very oriented to the "here and now") |
hand me * NOUN |
Hand me a towel. |
1,641 |
1,107 |
2 |
5.0 |
5.5 |
0.2 |
25.2 |
27.7 |
. Get out |
. Get out
before I call the police! |
11,263 |
10,374 |
23 |
34.7 |
51.9 |
2.7 |
12.8 |
19.2 |
do n't leave |
Don't leave!
I need you! |
4,890 |
4,667 |
39 |
15.0 |
23.3 |
0.7 |
21.5 |
33.3 |
Syntax: TV/Movies vs BNC-Spoken
(For more details and even more
carefully constructed data, see
Davies
2021)
In many cases, the data from the TV shows and
Movies is more informal than the Spoken
portion of the BNC (British National Corpus) in terms of syntax
(grammar).
Features #1-3 below are informal features of English
syntax (click on the BNC link to see evidence of this). In each
case, these informal features are more common in the TV and Movies
section of COCA than in the Spoken section of the BNC. Feature #4
(BE passive) is the least common in informal, spoken English (see
the BNC and COCA links), and it is even less common in the TV and
Movies section of COCA.
The charts show the normalized frequency (per
million words) in the BNC (left bar) and the last three decades of the
TV and Movies Corpus (with the TV chart on the left, and the Movies
chart on the right). The features are also much more
informal than COCA Spoken, which is not shown in the pictures below,
but which can be seen from the COCA link. |