If you're not a (corpus) linguist, you might wonder what a "corpus" is, and how it compares to other types of resources.

A corpus (plural: corpora) is just a collection of texts that can be used for any type of analysis -- linguistic, sociological, cultural, economic, etc. The most widely used corpora of English are the corpora from English-Corpora.org, of which the Coronavirus Corpus is a part.

Corpora are similar to "textual databases" like Lexis-Nexis, but they are different in that corpora typically allow a much large range of queries. For example, well-designed corpora allow you to to the following (see the main page of the corpus for many examples):

  • see the frequency of a word or phrase by date (or by other sections of the corpus)
  • limit by and compare words and phrases in different sections of the corpus (e.g. words that are more frequent in April 2020 than in March 2020)
  • find "collocates" (nearby words) of a given word, to investigate the meaning and usage of a word (following the maxim that "you can tell a lot about a word by the words that it hangs out with"), or
  • see concordance lines ("Keywords in Context"), to see patterns and phrases in which the word occurs

When you think about it, the Web is a kind of corpus as well (being a large "collection of texts"). But it doesn't really allow many of the types of searches listed above. With Google or Bing or another search engine, you search for a word or phrase and it simply links to web pages. A corpus allows much more than this. And for something like research on the coronavirus (COVID-19), it is much more useful than a simple web search.