The iWeb corpus contains about 14 billion words. This is about 25 times the size of COCA (the Corpus of Contemporary American English; 560 million words), and about 140 times the size of the BNC (British National Corpus, 100 million words). This difference in size means that iWeb is usually much better for lower-frequency words and grammatical constructions, and especially for semantic analysis via collocates. The following are just a few examples of how a 14 billion word corpus provides much richer and nuanced data.

WORD FREQUENCY. For high frequency words (e.g. words in the top 10,000 words of the language), a "small" 560 million word corpus may provide enough tokens for most analyses. For example, the following table shows a few words from about the 8,000th most common words in iWeb. On average, these randomly-selected words are about 32x as frequent in iWeb as in COCA (which makes sense for a corpus that is about 25x the size). But notice that the average frequency for these words in COCA is about 4,000 tokens -- probably enough for good analysis.

iWeb ID COCA ID Word PoS More iWeb Ratio PerMil Freq iWeb Freq COCA

8017 12811 unforgettable j 59.2 2.50 82252 1388

8018 5443 subway n 13.8 0.57 82239 5948

8019 7749 harmless j 25.9 1.18 82231 3175

8020 8647 complementary j 29.2 1.22 82205 2812

8021 21165 mini n 136.8 5.72 82196 600

8022 20420 plug-in j 102.8 4.31 82128 798

8023 15564 video v 72.6 3.04 82112 1130

8024 5724 aftermath n 13.6 0.57 82086 6023

8025 6469 racist j 15.9 0.75 82057 5149

8026 1657 Iraqi j 2.8 0.12 82036 28889

8027 7736 sport v 23.2 0.97 82005 3533

8028 9621 solving n 28.5 1.19 81970 2876

8029 4649 cling v 10.2 0.43 81964 8004

Let's now look at words down around the 28,000th most frequent word in the corpus. The average frequency in COCA is now about 360 tokens.

iWeb ID COCA ID Word PoS More iWeb Ratio PerMil Freq iWeb Freq COCA

28180 24539 banishment n 16.8 0.73 5524 327

28181 36190 crate v 51.6 2.21 5523 106

28182 54333 buildable j 134.7 5.61 5523 40

28183 23672 crust v 14.2 0.56 5523 387

28184 16018 institutionalized j 6.3 0.26 5522 871

28185 30998 butterscotch n 26.7 1.12 5521 206

28186 47323 glycoprotein n 36.3 1.51 5521 151

28187 23064 neo-nazi j 12.3 0.51 5520 449

28188 21704 oversimplify v 12.5 0.51 5520 442

28189 21289 battering n 11.9 0.48 5520 461

28190 41455 exchange-traded j 33.2 1.47 5519 165

28191 40653 shutoff n 55.2 2.21 5518 99

28192 20155 diatribe n 10.8 0.45 5517 508

28193 23721 infra r 5.4 0.23 5516 1023

28194 19203 coffeehouse n 8.7 0.36 5515 632

28195 20739 mob v 10.5 0.44 5515 523

28196 25649 gridiron n 17.6 0.72 5515 313

28197 32927 walking j 37.5 1.26 5515 146

28199 23547 sloop n 14.3 0.60 5514 384

COLLOCATES. 360 tokens might seem like this is a lot, but it may be too small, once we start looking at collocates (to get a sense of the meaning and usage of the words). For example, the word alabaster has 96 noun collocates that occur at least ten times in iWeb, but only 3 in COCA. Likewise, chicory has more than 100 noun collocates that occur at least 10 times, but only 2 in COCA. So 300-400 tokens for a word is often too small to really get insight into the meaning and usage of a word. (And the advantage of iWeb is even more pronounced for much lower frequency words.)

SYNTAX: Here are just a couple of examples of how very large corpora can provide much richer data for low-frequency constructions. Of course many others could be given as well.

1. In iWeb there are nearly 700 tokens of the string "HAVE been being _vvn" (e.g. had been being considered), compared to just 16 in COCA (and just 2 in the BNC). With more than 40 times as much data as COCA (and 350 times as much data as the BNC), it's obviously much more possibly to do careful syntactic analyses in iWeb.

2. In iWeb there are nearly 700 different main verbs that occur at least two times in the construction "VERB NOUN into VERB-ing" (e.g. talk him into going, trick her into believing, coerce them into staying). Compare this to 146 verbs in COCA and 51 verbs in the BNC. As the references in these links indicate, the more main verbs we have, the greater precision we will have in correctly describing this construction.