The iWeb corpus contains about 14 billion words. This is about 25 times the size of COCA (the Corpus of Contemporary American English; 560 million words), and about 140 times the size of the BNC (British National Corpus, 100 million words).  This difference in size means that iWeb is usually much better for lower-frequency words and grammatical constructions, and especially for semantic analysis via collocates. The following are just a few examples of how a 14 billion word corpus provides much richer and nuanced data.

 

WORD FREQUENCY. For high frequency words (e.g. words in the top 10,000 words of the language), a "small" 560 million word corpus may provide enough tokens for most analyses. For example, the following table shows a few words from about the 8,000th most common words in iWeb. On average, these randomly-selected words are about 32x as frequent in iWeb as in COCA (which makes sense for a corpus that is about 25x the size). But notice that the average frequency for these words in COCA is about 4,000 tokens -- probably enough for good analysis.

iWeb ID COCA ID  Word PoS More iWeb Ratio PerMil Freq iWeb Freq COCA
8017 12811  unforgettable j 59.2 2.50 82252 1388
8018 5443  subway n 13.8 0.57 82239 5948
8019 7749  harmless j 25.9 1.18 82231 3175
8020 8647  complementary j 29.2 1.22 82205 2812
8021 21165  mini n 136.8 5.72 82196 600
8022 20420  plug-in j 102.8 4.31 82128 798
8023 15564  video v 72.6 3.04 82112 1130
8024 5724  aftermath n 13.6 0.57 82086 6023
8025 6469  racist j 15.9 0.75 82057 5149
8026 1657  Iraqi j 2.8 0.12 82036 28889
8027 7736  sport v 23.2 0.97 82005 3533
8028 9621  solving n 28.5 1.19 81970 2876
8029 4649  cling v 10.2 0.43 81964 8004

 

Let's now look at words down around the 28,000th most frequent word in the corpus. The average frequency in COCA is now about 360 tokens.
 
iWeb ID COCA ID  Word PoS More iWeb Ratio PerMil Freq iWeb Freq COCA
28180 24539  banishment n 16.8 0.73 5524 327
28181 36190  crate v 51.6 2.21 5523 106
28182 54333  buildable j 134.7 5.61 5523 40
28183 23672  crust v 14.2 0.56 5523 387
28184 16018  institutionalized j 6.3 0.26 5522 871
28185 30998  butterscotch n 26.7 1.12 5521 206
28186 47323  glycoprotein n 36.3 1.51 5521 151
28187 23064  neo-nazi j 12.3 0.51 5520 449
28188 21704  oversimplify v 12.5 0.51 5520 442
28189 21289  battering n 11.9 0.48 5520 461
28190 41455  exchange-traded j 33.2 1.47 5519 165
28191 40653  shutoff n 55.2 2.21 5518 99
28192 20155  diatribe n 10.8 0.45 5517 508
28193 23721  infra r 5.4 0.23 5516 1023
28194 19203  coffeehouse n 8.7 0.36 5515 632
28195 20739  mob v 10.5 0.44 5515 523
28196 25649  gridiron n 17.6 0.72 5515 313
28197 32927  walking j 37.5 1.26 5515 146
28199 23547  sloop n 14.3 0.60 5514 384

 

COLLOCATES. 360 tokens might seem like this is a lot, but it may be too small, once we start looking at collocates (to get a sense of the meaning and usage of the words). For example, the word alabaster has 96 noun collocates that occur at least ten times in iWeb, but only 3 in COCA. Likewise, chicory has more than 100 noun collocates that occur at least 10 times, but only 2 in COCA. So 300-400 tokens for a word is often too small to really get insight into the meaning and usage of a word. (And the advantage of iWeb is even more pronounced for much lower frequency words.)

 

SYNTAX: Here are just a couple of examples of how very large corpora can provide much richer data for low-frequency constructions. Of course many others could be given as well.

 

1. In iWeb there are nearly 700 tokens of the string "HAVE been being _vvn" (e.g. had been being considered), compared to just 16 in COCA (and just 2 in the BNC). With more than 40 times as much data as COCA (and 350 times as much data as the BNC), it's obviously much more possibly to do careful syntactic analyses in iWeb.

 

2. In iWeb there are nearly 700 different main verbs that occur at least two times in the construction "VERB NOUN into VERB-ing" (e.g. talk him into going, trick her into believing, coerce them into staying). Compare this to 146 verbs in COCA and 51 verbs in the BNC. As the references in these links indicate, the more main verbs we have, the greater precision we will have in correctly describing this construction.