INSIGHT INTO VARIATION
The corpora from www.english-corpora.org allow research on
variation -- historical, between
dialects, and between
genres -- in
ways that are not possible with other corpora. This is due to at
least three factors:
1. CORPORA: texts from a
wide range of genres, dialects, and time periods -- not just a huge
"blob" of billions of words of easily-obtainable newspapers or web pages. In
that case, you might have information on a linguistic feature in just one genre
in one country at one time period, and really miss out on the richness and
variety of language.
2.
SIZE: the corpora are 100-200 times as large as (otherwise)
similar corpora, and so they potentially yield many more tokens (and yet they are still very fast)
3. QUERIES: our proprietary
corpus architecture and interface are designed "from the ground up" to allow
comparisons of different portions of the corpus (time periods, dialects, and
genres).
|
Note: click on any link
on this page to see the corpus data, and then
click on the "BACK" image (see left) at the top of the page to come back to
this page. |
HISTORICAL VARIATION (1810s-2000s)
COHA 
475 million words, 1820s-2010s.
100-200 times as large as any other structured
historical corpus of English.
-
Lexical: the frequency of any
word or phrase, e.g.
bestow,
swell (ADJ),
guys,
of no little,
as though to,
freak out
-
Lexical: compare all words in
different time periods, e.g.
*ism words (compare
earlier/later),
*heart*
words (earlier/later)
-
Phraseology:
so ADJ as to V,
BE
but,
HAVE quite V-ed,
a most ADJ NOUN
-
Syntax/grammar: e.g.
end up V-ing,
post-verbal negation with
need,
need to VERB,
sentence initial hopefully,
get
passive
-
Semantics/meaning: use
collocates to see change over time, e.g.
gay (compare
earlier/later),
chip,
engine,
web
-
Discourse/culture: use
collocates to see what we're saying about topics over time:
women (compare
earlier/later),
religion (earlier/later)
|
HISTORICAL VARIATION (recent:
1990-2019)
COCA
1 billion words, 1990-2019. The only large corpus that keeps the same genre
balance year to year (more...)
-
Lexical: the frequency of any
word or phrase, e.g.
morph,
old-school,
FREAK
out,
(think)
outside the box,
throw someone
under the bus,
BE
likely a|the
-
Lexical: compare all words in
different time periods, e.g. increases from 1990-94 (left) to 2010-2019
(right):
*ism
words,
*gate
words (potentially "scandal"),
*friendly
words (note increase), and phrasal
verbs with up. Note that not every entry is relevant, but it's
a good starting point.
-
Syntax/grammar: e.g. END
up V-ing, GET passive (got
hired), "quotative like"
(he's like, I'm not going), so
not ADJ (I'm so not interested in her)
-
Semantics/meaning: use
collocates to see change over time, e.g.
green,
web,
engine
-
Discourse/culture: changes in
frequency:
blacks,
retarded; use
collocates to see what we're saying about topics over time:
crisis,
terror,
gay
|
HISTORICAL VARIATION (Google
Books)
Google
Books (Advanced) 155 billion words,
1810s-2000s. Much more advanced interface/searches than the
standard Google Books n-grams.
-
Lexical: the frequency of any
word or phrase, e.g.
BESTOW,
a swell
NOUN
(chart),
guys,
of no little,
as though to,
FREAK out
-
Lexical: compare all words in
different time periods, e.g.
*ism words (compare
earlier/later),
*heart*
words (earlier/later)
-
Phraseology:
so ADJ as to
VERB
(table),
[be]
but a NOUN (table),
HAVE quite V-ed,
a most ADJ NOUN
(table)
-
Syntax/grammar: e.g. [end] up VERB-ing
(chart | table),
VERB someone into VERB-ing (chart | table), VERB
one's way PREP (e.g. force his way into), and who / whom + did +
PRON (e.g. who/whom did you (VERB); see chart showing
increase in who). Also,
must
VERB, should
VERB, ought
to VERB, has
to VERB, or need
to VERB.
-
Semantics/meaning: synonyms:
"beautiful" woman,
"clever"
person; collocates show change in meaning, e.g.
gay
(compare
earlier/later)
-
Discourse/culture: changes in
frequency:
negro,
colored person,
blacks,
deaf and dumb,
retarded,
handicapped; use
collocates to see what we're saying about topics over time (1800s vs
1970s-2000s):
fast,
art,
women,
music,
food
|
VARIATION BETWEEN
DIALECTS: compare 20 dialects of World English
GloWbE
1.9 billion words, 20 different countries. 100 times as large as the next-largest corpus of
English dialects (more...)
-
Lexical: the frequency of any
word or phrase, e.g.
fortnight,
on holiday,
banjax*,
bikkies,
thrice,
eve
teas*,
ACT the maggot,
lah!,
ackee
-
Lexical: compare all words in
different dialects, e.g.
*ism words by dialect ("core"
vs. South Asia),
*ies nouns in Australian
-
Phraseology: e.g.
BE
different to,
rather more ADJ,
take ADJ food,
in over ~ head,
USE ~ head,
MAKE ~
head spin
-
Syntax/grammar:
VERB
likely VERB
(e.g. would likely remember),
like construction,
way construction,
try and VERB,
go +
ADJ,
STOP someone V-ing
-
Semantics/meaning: use
collocates to see differences between dialects, e.g.
scheme (US/CA
= negative),
cupboards (US/CA = mainly kitchen)
-
Discourse/culture: frequency
of words, e.g.
Quran,
Buddh*,
feminism. With collocates, e.g.
ADJ belief (South Asia
vs "core"),
ADJ wife (+/- "core")
|
VARIATION BETWEEN
GENRES: American (COCA)
COCA
1 billion words, 1990-2019. The largest freely-available,
genre-balanced corpus currently available.
-
Lexical: the frequency of any
word or phrase, e.g. (spoken)
I guess,
, you know
, (fiction)
muffled,
frowned
(academic)
validity,
correlate
-
Lexical: compare all words in
different genres (give these 10-15 seconds each to run), e.g.
verbs (past
tense) in fiction,
ADJ in academic,
verbs in
religion magazines,
adjectives in medical academic
-
Phraseology: e.g.
. In
particular ,
a lot of,
kind of
NOUN,
type of NOUN;
phrasal verbs
with out (FIC/ACAD)
-
Syntax/grammar: (spoken)
and I'm like
, get
passive,
end up V-ing (fiction)
had been
V-ing, (academic)
be
passive,
appear to VERB,
must +
VERB
-
Semantics/meaning: use
collocates to see differences between genres, e.g. FIC (left) vs ACAD
(right):
chair,
chain,
string;
synonyms of
strong,
weak
-
Discourse/culture: frequency
of words and phrase, e.g.
global warming,
climate
change,
crippled,
people|person of
color
|
VARIATION BETWEEN
GENRES: British (BNC)
BNC
100 million words, 1980s-1993. Note: somewhat lower counts than COCA,
since the BNC is a much smaller corpus.
-
Lexical: the frequency of any
word or phrase, e.g. (spoken)
I reckon,
, you know
, (fiction)
muffled,
frowned
(academic)
validity,
correlate
-
Lexical: compare all words in
different genres, e.g.
verbs (past tense)
in fiction,
ADJ in academic,
verbs in sermons,
ADJ in tabloid
news
-
Phraseology: e.g.
. In
particular ,
a lot of,
kind of
NOUN, type
of NOUN;
phrasal verbs with
out (FIC/ACAD)
-
Syntax/grammar: (spoken)
get
passive,
BE V-ing,
(fiction) had
been V-ing, (academic)
be
passive,
appear to V,
HAVE
to VERB,
whom
-
Semantics/meaning: use
collocates to see differences between genres, e.g. FIC (left) vs ACAD
(right): chair,
chain,
string;
synonyms of
strong, weak
|
|