s


PDF with images: the best short introduction to the corpus
 


 

TEXTS: The iWeb corpus contains about 14 billion words in 22,388,141 web pages from 94,391 websites. As far as we are aware, this makes it one of only three large web-based corpora that contain more than 12-13 billion words.

 

VIRTUAL CORPORA: The nearly 95,000 websites for iWeb were chosen in a systematic way (unlike the random way that other large corpora have typically done it). This means that the websites have on average 240 web pages and 145,000 words each, and it means that you can search by website (and have it mean something).

 

In less than 10-20 seconds and with just 3-4 clicks, you can create "virtual corpora" for any topic, such as (NOUN) Buddhism, chocolate, fragrance, investment, basketball, climbing, grammar, gardener; (ADJ) philosophical, hydraulic, psychedelic, synaptic; (VERB) knit, drill, spray, hack, recycle; (ADJ+NOUN) financial services, essential oils, medical center, nervous system; (NOUN+NOUN) nuclear energy, weight loss, stock market, science fiction, web design (or any word or phrase).

 

And then for any virtual corpus that you create, you can search within just that virtual corpus, compare the frequency of words or phrases across different virtual corpora, or see the keywords from a virtual corpus (e.g. for Buddhism, investment, philosophical, hack, nervous system, or solar energy).

 

SEARCHES: As with the other BYU corpora, you can do many kinds of searches, such as single words (mysterious, skew), phrases (make up, on the other hand), "fuzzy phrases" (more * than, * bit), wildcards (*icity, *break*), lemma / forms (SKIM, BUY * NOUN), alternants and NOT ( fast|slow rate), pretty -NOUN), part of speech (rough NOUN, VERB money), synonyms (=beautiful, =gorgeous NOUN, =clever =man), customized word lists (@clothes, @colors @clothes), and and any combination of these (GET him|her to VERB, PUT on POSS @clothes).

 

Note that because the corpus has an architecture that is designed for speed (unlike other large 10-20 billion word corpora), even searches like the following only take a few seconds: got VERB-ed, I VERB PRON VERB, phrasal verbs, or NOUN NOUN.

 

BROWSE / WORDS: Finally -- but perhaps most importantly -- you can browse through the top 60,000 words in the corpus, and to see detailed information on any of these words. You can search by word form (e.g. *ism, un*able, *break*, fl?p), rank order (e.g. words near the 8,000th, 28,000th or 48,000th most frequent word), pronunciation (e.g. words rhyming with strange or mission, or three syllable words accented on the last syllable), or any combination of these (e.g. two syllable words that rhyme with stay where the stress is on the second syllable, mid-frequency two syllable adjectives ending in *y, or low frequency five syllable *ism words).

 

As you click on any of these links, not how you can easily and quickly move from one type of search to another (via the links at the top of those pages), and how you can easily move from one word to another. You can also see a history of all of these word-centered searches, and you can add words to your "favorites" list, for later review. Note of this functionality is available from any other large corpus of English.

 

Samples topic coll clus site KWIC
NOUN bread
NOUN Buddhism
VERB knit
VERB recycle
ADJ cozy
ADJ solar
ADV electronically