WORD FREQUENCY. For high frequency words (e.g. words in the top
10,000 words of the language), a "small" 560 million word corpus may
provide enough tokens for most analyses. For example, the following
table shows a few words from about the 8,000th most common words in iWeb.
On average, these randomly-selected words are about 32x as frequent in
iWeb as in COCA (which makes sense for a corpus that is about 25x the
size). But notice that the average frequency for these words in COCA is
about 4,000 tokens -- probably enough for good analysis.
Let's now look at words down around the 28,000th most frequent word in
the corpus. The average frequency in COCA is now about 360 tokens.
COLLOCATES. 360 tokens might seem like this is a lot, but it may be too small, once we start looking at collocates (to get a sense of the meaning and usage of the words). For example, the word alabaster has 96 noun collocates that occur at least ten times in iWeb, but only 3 in COCA. Likewise, chicory has more than 100 noun collocates that occur at least 10 times, but only 2 in COCA. So 300-400 tokens for a word is often too small to really get insight into the meaning and usage of a word. (And the advantage of iWeb is even more pronounced for much lower frequency words.)
SYNTAX: Here are just a couple of examples of how very large corpora can provide much richer data for low-frequency constructions. Of course many others could be given as well.
1. In iWeb there are nearly 700 tokens of the string "HAVE been being _vvn" (e.g. had been being considered), compared to just 16 in COCA (and just 2 in the BNC). With more than 40 times as much data as COCA (and 350 times as much data as the BNC), it's obviously much more possibly to do careful syntactic analyses in iWeb.
2. In iWeb there are nearly 700 different main verbs that occur at least two times in the construction "VERB NOUN into VERB-ing" (e.g. talk him into going, trick her into believing, coerce them into staying). Compare this to 146 verbs in COCA and 51 verbs in the BNC. As the references in these links indicate, the more main verbs we have, the greater precision we will have in correctly describing this construction.
|