Why create new corpora of
Spanish and Portuguese (two billion words for Spanish and one billion
words for Portuguese), when larger corpora already exist? For example,
Sketch Engine has a 3.9 billion word
corpus of Portuguese and a 9.6 billion word corpus of Spanish, making them 4-5
times as large as our proposed corpora. Likewise,
Corpora from the Web (COW) has a
Spanish corpus that
is almost twice as large as what we are proposing.
The reason why is that size is not everything.
Once the corpus is created, it is annotated for part of speech and lemma (e.g.
dice, dijo, and diremos are all forms of the lemma decir).
While it's easy to create a large corpus from the web for any language nowadays,
it's much harder to annotate it correctly and accurately. And without good
annotation, the corpus is almost unusable, at least for some purposes.
To correct the corpus, it requires that someone actually know Spanish or
Portuguese. Based on the accuracy of the Sketch Engine and COW corpora, it
appears that nobody did. They simply blindly ran the tagger on the corpora and
then placed them online, with little or no attempt to fix things. Quick, but not
To see what types of problems have
resulted from the inaccurate tagging and lemmatization, take a look at
the following spreadsheets, which show words starting with s- in the
Sketch Engine corpus. (Since COW uses the same tagger and since it hasn't been
corrected either, its output would be essentially the same. Search for some of
the "lemmas" in these lists in COW, and you'll see that the same problems are
there as well.) The spreadsheets group words by lemma and part of speech (noun,
verb, adjective, adverb), and it shows all lemmas that occur 20 times or more in
the corpus. Potential "problem" words are highlighted in yellow.
You will notice that the lists start out well. For
example, for Spanish the top ten verbs are ser, saber, seguir, salir, señalar,
sentir, servir, solicitar, suponer, sacar -- all verbs. So far, so good. But
down around word #1000, we find the following lemmas -- one after another:
satifacer, siempore, sako, simone, sómos, seguió, sperar, substituído, supply,
safó, sardinada, subiamos, subway, sobrescribe, soñabamos, secion, subredondear,
santalucía, scripta, scuba, selecionada, sostenian, surfea, sarpado, satisfacion,
sorpendido, suguiere, semibatir, september, seva. Virtually none of these
"verbs" are really lemmas. Either they are forms (or near forms) of lemmas
-- but not the actual lemmas (somos, soñabamos, sugiere, substituído,
subiamos, sostenian), or they are from another language (supply, subway,
scuba, september), or they are just "weird" (simone, santalucía, seva).
And this is near the top of the list, where someone could have presumably
corrected the first 1000 verbs or so -- had they known either Spanish or
Portuguese. Things get much stranger further down the list, e.g. around
verb #3200: salienron, salomé, sangree, scarce, scrooge, sdfr, sebita,
seeeeeeeeeee, separació, serásn, sexan, shay, shúper, silicone, simos, siome,
ske, sommer, sorcerer, spaña, swear, self-care. None of these are verb
lemmas, and none of them have been corrected in any way.
If you're going to create word frequency
data or language learning tools
like we've done for English, you need to carefully review thousands upon
thousands of words -- looking at their context, fixing lemmas and part of
speech, etc. And you need to have at least a rudimentary knowledge of the
language you're working with. None of this was done for these larger
Spanish and Portugese corpora and so they are -- as we have mentioned -- almost
unusable for many purposes.
With our Spanish and
Portuguese corpora, we will review each and every lemma (for
the top 50,000 lemmas in each corpus), to make sure that the lemma and the part
of speech are correct. It's a lot of work, and it will take several months to
compete. But once we're done, we will have the only large and usable
corpora of Spanish and Portuguese.