English-Corpora.org

English-Corpora.org


SIZE

Corpus size is incredibly important, in terms of the richness of the corpus data. A tiny one million word corpus is extremely limited in terms of the phenomena that it can study -- compared to a one billion word corpus, where there might be 1,000 times as much data.

The following are just a handful of examples that show the importance of size, from just two different types of searches -- low-level grammatical constructions and collocates. In each case, we show the number of tokens in the BNC (100 million words), COCA (1 billion words), and iWeb (14 billion words). Imagine also that we had a tiny 1 million word corpus. There would be virtually no tokens of any of these phenomena. (More information on the importance of size for historical corpora and data)

Note: click on any link on this page to see the corpus data, and then click on the "BACK" image (see left) at the top of the page to come back to this page.


GRAMMATICAL CONSTRUCTIONS

Construction Example BNC COCA iWeb
HAVE been being _v?n had been being considered 2 41 648
VERB + me into _v?g   (grouped by lemma; >=3) coerce + me into going 4 51 157
LOVE for PRON to VERB (I'd) love for him to help (us) 2 221 8848
VERB _ap way PREP ART NOUN   (# strings >= 3) pushed his way through the crowd 16 292 17,820


COLLOCATES
(see Excel file with hundreds of examples)

Corpus size is crucial in terms of finding collocates (nearby words, which provide valuable insight into meaning and usage). The following chart shows the number of collocates for a small sampling of "node" words. As you can see, there are many words that have a good frequency as a node word in the BNC (e.g. 166 tokens with browse), but which simply do not have many collocates (lemmas that occur five times or more, in a span of 4L to 4R). The size of the corpus is crucial, in terms of the richness of the collocates. A corpus like iWeb (at nearly 14 billion words) provides much more insight than a (now) "small-ish" corpus like the BNC.

Node (PoS) + collocate (PoS) Example BNC (node) COCA (node) BNC (coll) COCA (coll) iWeb
Verb + Noun browse 166 5718 2 398 4000+
Noun + Adjective stewardship 169 2515 0 63 715
Adjective + Noun outlandish 97 1649 0 72 978
Adverb + Verb rightfully 69 2264 1 88 615