English-Corpora.org


English-Corpora.org and SketchEngine are probably the two largest sites for online corpora. We believe that both sites provide valuable resources for linguists, lexicographers, and language learners and teachers.

The following is a comparison of the two sites, for those who are already family with Sketch Engine, but are new to English-Corpora.org. Admittedly (because this list is at English-Corpora.org), it is probably biased towards English-Corpora.org, and we invite you to look more in depth at what Sketch Engine has to offer as well. Finally, if there is incomplete / incorrect information below, please let us know.

Feature

Sketch Engine

English-Corpora.org

Corpora

- Extremely wide (90+) range of languages, and hundreds of corpora
- For English, very large web-based corpora, as well as many other specialized corpora

  • Mostly English, as well as some for Spanish and Portuguese
  • For English, perhaps the best suite of corpora for looking at variation: genre-based, historical, and dialectal
  • Largest corpora are iWeb (14 billion words) and NOW (14.6 billion words and growing by ~250 million words each month)

Users / research

- Linguistics and lexicographers, teachers and learners, etc
(For those with information on Sketch Engine, please send us more detailed / verifiable  information on number of users, researchers, universities with licenses,  number of publications, etc)

- ~130,000 distinct users each month, including about 80,000 registered users
- ~300 universities have academic (group) licenses, as well as large government-funded licenses
- More than 16,500 registered "researchers" (professors or graduate students) in linguistics or language studies
- Cited in more than 10,000 academic publications, including more than 5,000 in the past five years
- The data (e.g. full-text, word frequency) is used by hundreds of companies, including Google, Amazon, Microsoft, IBM, Samsung; Merriam-Webster, dictionary.com, Grammarly, Oxford University Press, Education Testing Services, Michigan Language Assessment, Sketch Engine, etc.

Pricing

- Free 30 day trial
- Individual: ~$83 (USD) for one year (non-EU)
- University: ~$7,800 USD (one year, 200 users, read only)

- Free forever: up to 50 or 200 searches per day; depending on professor / grad student, etc
- Individual: $30 for one year. As little as $20/yr for multi-year (2, 3 year) renewals
- University: $795 (one year, unlimited users). Discounts for multi-year licenses

     

Search syntax (concordances in SE)

- Very powerful, using CQL (Corpus Query Language). Example:
- [lemma="buy"] [word="some"] [tag="J.*"] [tag="NN.*"]
 

- Wide range of search types: word, lemma, part of speech, synonyms, customized wordlists, etc
- Emphasis on usability / simplicity (especially mobile users); example: BUY * ADJ NOUN
- Even more powerful queries, e.g. BUY * =expensive @CLOTHES (buy some pricey shoes, bought an expensive dress)
- Integrated n-grams provide lightning fast searches (e.g. 14 billion words: some ADJ NOUN, VERB * NOUN)

Speed

- Very fast; often just a few seconds to search billions of words of data

- About 5-20 times as fast as Sketch Engine (depending on the search). Probably the fastest available architecture for full-featured corpora.

     

Word-level information / sketches

- Powerful features, including advanced displays of collocates and n-grams

- Advanced displays for collocates, including "topics" (words co-occurring anywhere on page)
- "Home page" for each of the top 60,000 words, including definition, synonyms, frequency information, collocates, related topics, Keyword in Context (concordance), clusters (n-grams), related word forms, frequency information (dispersion, word forms, etc), and links to external resources: definitions, WordNet, pronunciation, images, video, etc.
- Can save words (and phrases) for later study and analysis, and assign to user-defined categories

Browse words

 

- Browse through and search carefully-corrected lists of the top 60,000 word (lemmas) (in COCA and iWeb)
- Search by word form, frequency, definition, synonym, more specific/general words, and pronunciation
- Integrated Academic Vocabulary List (top 3,000 words in academic English)

Analyze entire texts

 

- Input entire texts (up to 5,000 words); for example, articles from online newspapers or student compositions
- Words: find keywords in texts, and click on any word for detailed "word sketch" (see above)
- Phrases: quickly and easily select phrases in text, and find similar phrases in COCA (a sort of "phraseological thesaurus")

User-defined / virtual corpora

- Extremely powerful features to create corpora from your own texts (including use of BootCat)
- Can create "virtual" corpora from texts in any of the SE-created corpora

- No importing of user-created corpora (although see "analyze texts" above)
- Powerful "virtual" corpora (quick overview), based on user-defined corpora from 17 standard corpora
- Create virtual corpora based on word or phrases, or metadata for the texts
- Can create virtual corpora with just a few clicks, in just a few seconds

     

Tied in to external resources

 

- (From word and phrase frequency page): Google (Books), translation, pronunciation, videos
- (From concordance and expanded text page): translation (one line or many lines), pronunciation, and full range of word-level resources (see above)

Downloadable data

- Wide range of (uncorrected) data (e.g. word frequency) from within the web interface

- Downloadable full-text data, including tens of billions of words of data
- Carefully corrected data: word frequency, collocates, and n-grams