In 2015 we received a large
two-year grant (see page 37) from the US National Endowment
for the Humanities (NEH) to improve and expand our Spanish and Portuguese corpora. The
goal is to bring them up to the level of our
English corpora, which are used by more than 130,000 researchers, teachers, and
students every month (thus making them the most widely-used corpora of English).
This project will greatly expand in size and scope the
Corpus del Español and the
Corpus do Português. Both of
these corpora were previously funded by the NEH (2001-02 and 2004-06,
respectively), but after 10-15 years, they were beginning to "show their age" a
bit.
Features
-
Size. For the modern period of both
languages, the new Spanish corpus (two billion words) will be 100
times as large as before, and the new Portuguese corpus (one billion
words) will be 50 times as large as before. So where a researcher might
have just 20 tokens of a word or construction with the older corpora, they might now
have 1,000 or even 2,000 tokens with the new corpora, which will lead to much more insightful studies.
-
More recent. The most recent texts from
the original Corpus del Español and the Corpus do Português were from 1999,
but the texts for the new corpora will all be very recent -- from just the last 3-4
years.
-
Comparing dialects. Unlike the older
Corpus del Español and the Corpus do Português, the new corpora will allow researchers to compare
frequency and usage across different Spanish and Portuguese-speaking
countries, in the same way as what is
already possible for English
with the GloWbE corpus.
Related resources
-
This project will also develop corpus-based resources that are
oriented more towards teachers and learners, similar to what he has
already done with WordAndPhase for
English.
-
Users will be able to browse through a list of the
top
40,000 words of Spanish or Portuguese and see definitions and synonyms, examine
variation by country and genre (e.g. spoken, fiction, newspaper, and academic),
find collocates (nearby words; to see meaning and usage) and see concordance
lines (to study patterns in which the word occurs).
-
They will also be able to input and analyze entire texts, based on the data from the Spanish and Portuguese
corpora.
-
The new, greatly expanded corpora will be used
to create other resources, such as
word frequency,
n-grams,
collocates, and
full-text data that is
similar to what is already available for English.
-
Depending on how the preceding projects go, we
may also create corpora of Spanish and Portuguese Wikipedia, similar
to what we've already done for English
Wikipedia. This will also users to quickly and easily create "virtual
corpora" on any topic that might interest them, and then extract the
keywords for that particular topic.
Tentative timeline:
2015. May-Oct |
Collect the texts for the two corpora |
2015. Nov - 2016. May |
Edit the texts and check annotation
(part of speech tagging and lemmatization) |
2016.
Aug |
Two new corpora online
(format and size similar to the GloWbE
corpus for English) |
2016. Sep - 2017. Mar |
Work on the WordAndPhrase-like
resources |
2017.
Aug |
WordAndPhrase-like
resources online |
(2017. Sep-Dec) |
(Possibly: develop corpora for Spanish and Portuguese
that will be added to every day, similar to the English
NOW corpus) |
|