SIZE
Corpus size is incredibly important, in terms of
the richness of the corpus data. A tiny one million word corpus is extremely
limited in terms of the phenomena that it can study -- compared to a one billion
word corpus, where there might be 1,000 times as much data.
The following are just a handful of
examples that show the importance of size, from just two different types of searches -- low-level
grammatical constructions and collocates. In each case, we show the
number of tokens in the BNC
(100 million words), COCA (1
billion words), and iWeb (14 billion
words). Imagine also that we had a tiny 1 million word corpus. There would be
virtually no tokens of any of these phenomena. (More
information on the importance of size for historical corpora and data)
|
Note: click on any link
on this page to see the corpus data, and then
click on the "BACK" image (see left) at the top of the page to come back to
this page. |
GRAMMATICAL CONSTRUCTIONS
Construction |
Example |
BNC |
COCA |
iWeb |
HAVE been being _v?n |
had been being considered |
2 |
41 |
648 |
VERB + me into _v?g (grouped by
lemma; >=3) |
coerce + me into going |
4 |
51 |
157 |
LOVE for PRON to VERB |
(I'd) love for him to help (us) |
2 |
221 |
8848 |
VERB _ap way PREP ART NOUN (#
strings >= 3) |
pushed his way through the crowd |
16 |
292 |
17,820 |
COLLOCATES (see
Excel file with hundreds of
examples)
Corpus size is crucial in terms of
finding collocates (nearby words, which provide valuable insight
into meaning and usage). The following chart shows the number of
collocates for a small sampling of "node" words. As you can see,
there are many words that have a good frequency as a node word in
the BNC (e.g. 166 tokens with browse), but which simply do not have many collocates (lemmas
that occur
five times or more, in a span of 4L to 4R). The size of the corpus
is crucial, in terms of the richness of the collocates. A corpus
like iWeb (at nearly 14 billion words) provides much more insight
than a (now) "small-ish" corpus like the BNC.
|