Corpus of Contemporary American English (COCA)
Google / Web as Corpus
The Web is much larger than the
Corpus of Contemporary American English
(COCA), and Google
is a great search engine. So why not just use Google to see what's
happening in contemporary American English? Well, as good as it is for
most searches, there are things that neither Google (nor any other search
engine) can do (or which they do only very poorly), but which are
possible with our corpus. These include the following:
differences between different styles or types of English. Is the
"I'm like..." construction,
. Well ,, or the word
attitudinal used more in
informal (e.g. spoken) or formal (e.g. academic) English? Google is
pretty good at knowing what domain something comes from (e.g. cbs.com or neh.org), but it can't really relate that (well) to
"genre", or "styles of speech".
over time. Is the word
box used more or less now than in the
early 1990s? Which verbs are really on the increase during the last
2-3 years? No way to check this with Google/glowbe or other search engines.
end up VERB-ing (e.g. ended up paying too
much) on the
increase or decrease? Is the
get passive (e.g.
get married) used more in
spoken or academic?
Google doesn't allow you to search by part of speech or lemma (e.g.
all of the forms of a word). You'd have to search for each string
individually (e.g. all forms of end + up + every conceivable
searches. What are the most common word forms with the strings
(e.g. Soviet-backed) or
(e.g. hyperspace)? Wildcards are no problem with COCA, but
you can't use them with Google.
searches. How are
used in the language? In order to find out, you need to look at collocates
(nearby words), since (as corpus linguists are fond of saying) "the
words that a word 'hangs out with' can tell you a lot about its
meaning". But Google doesn't do collocates.
semantically-based searches. Since Google can't do collocates,
it obviously can't use them to compare word meanings in different genres
(e.g. chair in fiction and academic), or to see how they're
changing over time (e.g.
green = "environmentally friendly").
And even more
complex semantically-based searches. Google only really knows
how to search for specific words and strings. It doesn't let you
search by words that are related in meaning, such
as all of the synonyms
of a given word, or all of the words in
personalized lists you've created (related to fashion, or food, or clothing, or
whatever) as part of a query. Our corpus can do both of these.
Finding the word
when you don't know what the word is. What are the
are found mainly is medical articles,
collocates of hard
that are used more in fiction or newspapers, or
synonyms of strong that are
found mainly in fiction or academic? Google allows you to find the occurrence of
a given form that you already know, but it can't produce a list of
words for you that match criteria like these.
strings of words. Sure, on Google you can search for a phrase
like "might be taken for a".
Go ahead and try it. How many
hits does it say there are? Our search today shows 955,000. Start
paging through the hits, though, and they run out at about 450 (e.g.
44 pages of 10 links each, and then they end). In
other words, Google's "guess" is more than 2000 times more than what it should be.
This is because Google usually doesn't "know" the frequency of
anything more than single words -- it's usually just guessing.
So if you want to find
web pages dealing with a certain topic, then Google is fine. But
using Google as a full-blown linguistic search engine has real
drawbacks. None of the preceding types of searches -- which are some
of the most interesting ones that you can carry out to see what's
going on with the language -- are possible with Google (or any other
search engine). But they are all possible -- quickly and easily -- with the Corpus of
Contemporary American English.