English Corpora: most widely used online corpora. Billions of words of data: free online access

In the British National Corpus 1994, the Spoken texts account for 10% of the corpus (or about 10 million words). In COCA, we wanted Spoken to be more much larger than that, and so about 127 million words (or about 12.5% of the corpus) comes from spoken American English.

It would have been impossible, however, to create a corpus that size by tape recording lectures, conversations, etc – especially since the corpus was created by one person, in about 3-4 months, with a budget of approximately $0. The only option was to use transcripts of conversations, which were already in electronic form. Therefore, we obtained transcripts of unscripted conversation on TV and radio programs like All Things Considered (NPR), Newshour (PBS), Good Morning America (ABC), Today Show (NBC), 60 Minutes (CBS), Hannity and Colmes (Fox), Jerry Springer (syndicated), etc.

There are several questions, of course, regarding the use of transcripts like these. Perhaps the three most important ones are:

     1) Do they faithfully represent the actual conversations?
     2) Is the conversation really unscripted?
     3) How well does it represent "non-media" varieties of Spoken American English?

Regarding the first question, we feel confident that the transcripts do represent very well the actual spoken conversation. Watch this video from the Larry King show on CNN (the Tiger Woods segment) and then compare it to the transcript. Another example is from the "Talk of the Nation" show on NPR. Compare the audio recording (click on "Listen Now") with the transcript (as contained in our corpus). Our sense is that the transcripts do an excellent job transcribing the conversation, including interruptions, false starts, and so on.

The second question is whether the conversation is really unscripted. In the Larry King interview above, there are a handful of "formulaic / scripted" sentences like "Welcome to the program", "We'll now go to a commercial break", etc. But probably 97-98% or more of the conversation is unscripted. In the NPR transcript, there is a bit more scripted material -- a paragraph or two at the beginning of the show, and some announcements for upcoming commercial breaks. But about 95% or so is still unscripted. The question is whether you would rather have a 127 million word spoken corpus with about 5% scripted material (but still leaving more than 120 million words of unscripted material), or a "completely pure" corpus that is so small (1-2 million words) that it is unusable for many types of research. We opted for the former.

In terms of the third issue (naturalness), there is one aspect of these texts that does make them somewhat unlike completely natural conversation. That is of course the fact that the people knew that they were on a national TV or radio program, and they therefore probably altered their speech accordingly -- such as relatively little profanity and perhaps avoiding highly stigmatized words and phrases like ain't got none. In terms of overall word choice and "natural conversation" (false starts, interruptions, and so on), though, it does seem to represent "off the air" conversation quite nicely. But no spoken corpus (even those created by linguists with tape recorders in the early 1990s) will be 100% authentic for real conversation -- as long as people know that they're being recorded.

Finally, it is possible to do some quick searches that show the overwhelming "spoken" nature of these texts. The following phrases are ones that we would expect to occur much more frequently in spoken (American) English than in other genres like Fiction, Newspapers, or Academic (but in most cases, not as much as TV/Movies, which are even more informal). Click on them to see how common they are in spoken and the other genres:

Note: click on any link on this page to see the corpus data, and then click on the "BACK" image (see left) at the top of the page to come back to this page.

One other note:

In the transcripts there is text indicating who the speaker is, or codes referring to "voice-overs" or other notations made by the transcriber. For example:

SUMMIT It should be a C-note. Mr. CARY ANDERSON That's it. Mr. ANDERSON Oh, very good. See, you didn't have to get nervous, Mr. Cronick. You were really very good at it. SUMMIT All right. Mr. ANDERSON You -- you were coming fast and furious here. It was great. I could sleep. ( Laughing ) I appreciate it.

We were able to "separate out" most of these (shown in gray above). Where they have been separated out, they are not included in the overall word count, nor can they be searched for (this is on purpose, since they're not "spoken"), but they do appear in the Keyword in Context displays.

English-Corpora.org