Corpora AI / LLMs Overview PDF Videos Resources Help / FAQ My account † corpus.byu.edu

English-Corpora.org

NEW: SPANISH AND PORTUGUESE CORPORA (COMPARE TO OTHER CORPORA)

In 2015 we received a large two-year grant (see page 37) from the US National Endowment for the Humanities (NEH) to improve and expand our Spanish and Portuguese corpora. The goal is to bring them up to the level of our English corpora, which are used by more than 130,000 researchers, teachers, and students every month (thus making them the most widely-used corpora of English).

This project will greatly expand in size and scope the Corpus del Español and the Corpus do Português. Both of these corpora were previously funded by the NEH (2001-02 and 2004-06, respectively), but after 10-15 years, they were beginning to "show their age" a bit.

Features

Size. For the modern period of both languages, the new Spanish corpus (two billion words) will be 100 times as large as before, and the new Portuguese corpus (one billion words) will be 50 times as large as before. So where a researcher might have just 20 tokens of a word or construction with the older corpora, they might now have 1,000 or even 2,000 tokens with the new corpora, which will lead to much more insightful studies.
More recent. The most recent texts from the original Corpus del Español and the Corpus do Português were from 1999, but the texts for the new corpora will all be very recent -- from just the last 3-4 years.
Comparing dialects. Unlike the older Corpus del Español and the Corpus do Português, the new corpora will allow researchers to compare frequency and usage across different Spanish and Portuguese-speaking countries, in the same way as what is already possible for English with the GloWbE corpus.

Related resources

This project will also develop corpus-based resources that are oriented more towards teachers and learners, similar to what he has already done with WordAndPhase for English.
Users will be able to browse through a list of the top 40,000 words of Spanish or Portuguese and see definitions and synonyms, examine variation by country and genre (e.g. spoken, fiction, newspaper, and academic), find collocates (nearby words; to see meaning and usage) and see concordance lines (to study patterns in which the word occurs).
They will also be able to input and analyze entire texts, based on the data from the Spanish and Portuguese corpora.
The new, greatly expanded corpora will be used to create other resources, such as word frequency, n-grams, collocates, and full-text data that is similar to what is already available for English.
Depending on how the preceding projects go, we may also create corpora of Spanish and Portuguese Wikipedia, similar to what we've already done for English Wikipedia. This will also users to quickly and easily create "virtual corpora" on any topic that might interest them, and then extract the keywords for that particular topic.

Tentative timeline:

2015. May-Oct	Collect the texts for the two corpora
2015. Nov - 2016. May	Edit the texts and check annotation (part of speech tagging and lemmatization)
2016. Aug	Two new corpora online (format and size similar to the GloWbE corpus for English)
2016. Sep - 2017. Mar	Work on the WordAndPhrase-like resources
2017. Aug	WordAndPhrase-like resources online
(2017. Sep-Dec)	(Possibly: develop corpora for Spanish and Portuguese that will be added to every day, similar to the English NOW corpus)