Corpora AI / LLMs Overview PDF Videos Resources Help / FAQ My account † corpus.byu.edu

English-Corpora.org

SIZE

Corpus size is incredibly important, in terms of the richness of the corpus data. A tiny one million word corpus is extremely limited in terms of the phenomena that it can study -- compared to a one billion word corpus, where there might be 1,000 times as much data.

The following are just a handful of examples that show the importance of size, from just two different types of searches -- low-level grammatical constructions and collocates. In each case, we show the number of tokens in the BNC (100 million words), COCA (1 billion words), and iWeb (14 billion words). Imagine also that we had a tiny 1 million word corpus. There would be virtually no tokens of any of these phenomena. (More information on the importance of size for historical corpora and data)

Note: click on any link on this page to see the corpus data, and then click on the "BACK" image (see left) at the top of the page to come back to this page.

GRAMMATICAL CONSTRUCTIONS

Construction	Example	BNC	COCA	iWeb
HAVE been being _v?n	had been being considered	2	41	648
VERB + me into _v?g (grouped by lemma; >=3)	coerce + me into going	4	51	157
LOVE for PRON to VERB	(I'd) love for him to help (us)	2	221	8848
VERB _ap way PREP ART NOUN (# strings >= 3)	pushed his way through the crowd	16	292	17,820

COLLOCATES (see Excel file with hundreds of examples)

Corpus size is crucial in terms of finding collocates (nearby words, which provide valuable insight into meaning and usage). The following chart shows the number of collocates for a small sampling of "node" words. As you can see, there are many words that have a good frequency as a node word in the BNC (e.g. 166 tokens with browse), but which simply do not have many collocates (lemmas that occur five times or more, in a span of 4L to 4R). The size of the corpus is crucial, in terms of the richness of the collocates. A corpus like iWeb (at nearly 14 billion words) provides much more insight than a (now) "small-ish" corpus like the BNC.

Node (PoS) + collocate (PoS)	Example	BNC (node)	COCA (node)	BNC (coll)	COCA (coll)	iWeb
Verb + Noun	browse	166	5718	2	398	4000+
Noun + Adjective	stewardship	169	2515	0	63	715
Adjective + Noun	outlandish	97	1649	0	72	978
Adverb + Verb	rightfully	69	2264	1	88	615