English-Corpora.org and SketchEngine are probably
the two largest sites for online corpora. We believe that both sites provide
valuable resources for linguists, lexicographers, and language learners and
teachers.
The following is a comparison of the two sites, for
those who are already family with Sketch Engine, but are new to English-Corpora.org.
Admittedly (because this list is at English-Corpora.org), it is probably biased
towards English-Corpora.org, and we invite you to look more in depth at what
Sketch Engine has to offer as well. Finally, if there is incomplete / incorrect
information below, please let us know.
Feature |
Sketch Engine |
English-Corpora.org |
Corpora |
- Extremely wide (90+) range of languages, and hundreds
of corpora
- For English, very large web-based corpora, as well as many other
specialized corpora |
- Mostly English, as well as some for
Spanish
and Portuguese
- For English, perhaps the best suite of corpora for looking at
variation: genre-based, historical, and
dialectal
- Largest corpora are
iWeb (14 billion words) and
NOW (14.6 billion
words and growing by ~250 million words each month)
|
Users / research |
-
Linguistics and lexicographers,
teachers and learners,
etc
(For those with information on Sketch Engine,
please send us more detailed / verifiable information on number of
users, researchers, universities with licenses, number of
publications, etc) |
- ~130,000 distinct users
each month, including about 80,000 registered users
- ~300
universities have academic (group) licenses, as well as large
government-funded licenses
- More than 16,500 registered "researchers"
(professors or graduate students) in linguistics or language studies
- Cited in more than 10,000
academic publications, including more than 5,000 in the past five
years
- The data (e.g.
full-text,
word frequency) is used by hundreds of companies, including Google, Amazon,
Microsoft, IBM, Samsung; Merriam-Webster, dictionary.com, Grammarly,
Oxford University Press, Education Testing Services, Michigan Language
Assessment, Sketch Engine, etc. |
Pricing |
- Free 30 day trial
-
Individual: ~$83 (USD) for one year (non-EU)
-
University: ~$7,800 USD (one year, 200 users, read only) |
- Free forever: up to 50 or 200 searches per day;
depending on ± professor / grad student,
etc
- Individual: $30 for one year. As little as
$25/yr for multi-year (2, 3 year) renewals
-
University: $795 (one year, unlimited users). Discounts for
multi-year licenses |
|
|
|
Search syntax (concordances in SE) |
- Very powerful, using
CQL (Corpus Query Language). Example:
- [lemma="buy"] [word="some"] [tag="J.*"] [tag="NN.*"]
|
- Wide range of search types:
word, lemma, part of speech, synonyms, customized wordlists, etc
- Emphasis on usability / simplicity (especially mobile users); example:
BUY * ADJ NOUN
- Even more powerful queries, e.g. BUY * =expensive @CLOTHES (buy
some pricey shoes, bought an expensive dress)
- Integrated n-grams provide lightning fast searches (e.g. 14 billion
words: some
ADJ NOUN,
VERB * NOUN) |
Speed |
- Very fast; often just a few seconds to search billions of
words of data |
- About 5-20 times as fast as Sketch
Engine (depending on the search). Probably the fastest available
architecture for full-featured corpora. |
|
|
|
Word-level information / sketches |
- Powerful features, including advanced displays of
collocates and n-grams |
- Advanced displays for collocates, including "topics"
(words co-occurring anywhere on page)
- "Home page" for each of the top 60,000 words, including definition,
synonyms, frequency information, collocates, related topics, Keyword in
Context (concordance), clusters (n-grams), related word forms, frequency
information (dispersion, word forms, etc), and links to external
resources: definitions, WordNet, pronunciation, images, video, etc.
- Can save words (and phrases) for later study and analysis, and assign
to user-defined categories |
Browse words |
|
- Browse through and search carefully-corrected lists
of the top 60,000 word (lemmas) (in COCA and iWeb)
- Search by word form, frequency, definition, synonym, more
specific/general words, and pronunciation
- Integrated Academic Vocabulary List (top 3,000 words in academic
English) |
Analyze entire texts |
|
-
Input entire texts (up to 5,000 words); for example,
articles from online newspapers or student compositions
-
Words: find keywords in texts, and click on any word for detailed "word
sketch" (see above)
- Phrases: quickly and easily select phrases in text, and find similar phrases in
COCA (a sort of "phraseological thesaurus") |
User-defined / virtual corpora |
- Extremely powerful features to create corpora from
your own texts (including use of BootCat)
- Can create "virtual" corpora from texts in any of the SE-created
corpora |
- No importing of user-created corpora (although see
"analyze texts" above)
- Powerful "virtual" corpora
(quick overview), based on user-defined corpora from 17
standard corpora
- Create virtual corpora based on word or phrases, or metadata for the
texts
- Can create virtual corpora with just a few clicks, in just a few
seconds |
|
|
|
Tied in to external resources |
|
- (From word and phrase frequency page): Google
(Books), translation, pronunciation, videos
- (From concordance and expanded text page): translation (one line or
many lines), pronunciation, and full range of word-level resources (see
above) |
Downloadable data |
- Wide range of
(uncorrected) data (e.g. word frequency) from within
the web interface |
- Downloadable
full-text data, including tens of
billions of words of data
- Carefully corrected data:
word frequency,
collocates, and
n-grams |
|