PDF with images: the best short introduction to the corpus
 


 

The following is a short tour of the COCA corpus, including new features in March 2020. You can click on any of the links below to carry out sample searches, and then return to this page (for more searches) by clicking on TOUR at the top of the page.

 

TEXTS The COCA corpus contains about 1 billion words in nearly 500,000 texts from 1990 to 2019 -- which are nearly evenly divided between spoken, fiction, magazines, newspapers, academic journals, blogs, other web pages, and TV/Movie subtitles (120-130 million words in each genre). In addition, there are 20 million words each year from 1990-2019 (with the same genre balance each year), which means that you have very recent data -- not just stale data from 25-30 years ago. This is the only corpus of English that is 1) large 2) recent and 3) has a wide range of genres.

 

SEARCHES As with the other BYU corpora, you can do many kinds of searches, such as single words (mysterious, skew), phrases (make up, on the other hand), "fuzzy phrases" (more * than, * bit), wildcards (*icity, *break*), lemma / forms (SKIM, BUY * NOUN), alternants and NOT ( fast|slow rate), pretty -NOUN), part of speech (rough NOUN, VERB money), synonyms (=beautiful, =gorgeous NOUN, =clever =man), customized word lists (@clothes, @colors @clothes), and and any combination of these (GET him|her to VERB, PUT on POSS @clothes). Click on the box next to SECTIONS in the search form to see the results from each genre and time period, e.g. rough NOUN or =gorgeous NOUN). You can also compare words, such as nouns that appear with small and little, adjectives with men and women, or nouns with utter and sheer.

 

Note that because the corpus has an architecture that is designed for speed (unlike other large 10-20 billion word corpora), even searches like the following only take a few seconds: got VERB-ed, I VERB PRON VERB, phrasal verbs, or NOUN NOUN.

GENRES Because the corpus has so much data from each of the eight genres listed above, you can do powerful searches to compare among these genres. This lets you know, for example, whether a particular word or phrase or grammatical construction is more informal or formal. For example, look for the frequency of I guess, muffled, validity, or forms of need + to + VERB. Via the chart display, you can also see the frequency of the word or phrase in subregisters as well, such as movie scripts, children's fiction, women's magazines, or medical journals.

You can also limit searches to particular genres, such as *ize verbs in ACADEMIC, past tense verb + up in TV/MOVIES; nouns near chair in FIC; or adjectives in ACADEMIC-Medicine (click on SECTIONS in the search form to see the selected genres, if they're not already visible). Better yet, you can easily and quickly compare all of the results in one genre (or sub-genre) with another. For example past tense verb + up in TV/M vs ACAD; hard + NOUN in MAG vs ACAD; nouns in NEWS-Money vs NEWS; adjectives in ACAD-Medicine vs ACAD; synonyms of strong in TV/M vs ACAD; or nouns near chair in ACAD vs FIC, or ADJ + WOMAN in FIC vs ACAD.

CHANGES OVER TIME The COCA corpus is the only large corpus of English that contains data (20 million words of data, with the same genre balance) in each year from 1990-2019. This allows you to see the frequency of any word or phrase over time, such as gift (as a verb), awesome, or BE likely a|the. You can also compare all words in different periods, such as -ed verbs, the suffix -friendly, or words ending in -ism. You can also see the frequency of each matching form in each five year period, such as -ism words since the early 1990s. You can also look at syntactic constructions over time, such as end up V-ing, get passive (got hired; an increase in each five year period),  so not ADJ (I'm so not interested in her), and "quotative like" (he's like, I'm not going). You can compare collocates over time so see changes in meaning: green, web, engine, and you can see the frequency of collocates in each five year period since the early 1990s (nuclear, crisis) to examine changes in American culture and society during this time (again, just click on the box by SECTIONS to see the frequency by section).

BROWSE / WORDS Finally -- but perhaps most importantly -- you can browse through the top 60,000 words in the corpus, and to see detailed information on any of these words. You can search by word form (e.g. *ism, un*able, *break*, fl?p), rank order (e.g. words near the 8,000th, 28,000th or 48,000th most frequent word), pronunciation, e.g. words rhyming with light (note the many variant spellings for this sound), or three syllable words accented on the last syllable, or any combination of these (e.g. two syllable words that rhyme with stay where the stress is on the second syllable, mid-frequency two syllable adjectives ending in *y, or low frequency five syllable *ism words). You can also search for a specific word via WORD in the search form.

When you see a word of interest, you can get a wealth of information on that word, including frequency information, definition, translation, link to audio, images, and videos, as well as synonyms, related word forms, topics (words that co-occur anywhere in the ~500,000 texts), collocates, clusters, and concordance / KWIC lines. Click on any of the links below for examples of this "word-based" information for a few sample words.

 

As you click on any of these links, note how you can easily and quickly move from one type of search to another (via the links at the top of those pages), and how you can easily move from one word to another. You can also see a history of all of these word-centered searches, and you can add words to your "favorites" list, for later review. Note of this functionality is available from any other large corpus of English.
 

Samples topic coll clus site KWIC
NOUN bread
NOUN Buddhism
VERB knit
VERB recycle
ADJ cozy
ADJ solar
ADV electronically