The Coronavirus Corpus is composed of 1,492,979,248 words of data in 1,873,327 from January 2020 - December (31st) 2022. We have decided not to extend the corpus beyond Dec 2022, because in virtually every country (with one major exception), COVID is pretty much under control, and there are now relatively few newspaper articles about the pandemic. If you are interested in data after December 2022, feel free to use the NOW corpus, and simply limit your search to COVID-related topics. The Coronavirus Corpus contains a subset of the NOW Corpus. The new articles in the NOW Corpus (about 10,000 articles each day) come from links harvested from hourly Bing News searches, as well as searching 1,000+ websites to find articles that have appeared in the previous 24 hours. We then download the texts, clean them up with JusText (to remove boilerplate material); tag and lemmatize them; and then integrate them into our existing relational database architecture. As mentioned, the Coronavirus Corpus is a subset of the NOW Corpus -- articles dealing in some way with the coronavirus. There are two methods to find these articles: 1. Articles that have at least two
occurrences of the words { coronavirus, COVID*, or pandemic* }. The following
are the number of words, texts, and distinct websites for each of the 36 months in the corpus (Jan 2020 - Dec 2022).
|
# WEBSITES | # TEXTS | # WORDS | TOTAL = 0 WORDS | |
20-01 | 2,310 | 12,574 | 7,340,233 | |
20-02 | 2,636 | 24,178 | 14,494,437 | |
20-03 | 5,933 | 144,482 | 99,990,266 | |
20-04 | 6,225 | 160,599 | 107,974,419 | |
20-05 | 5,248 | 131,520 | 97,812,229 | |
20-06 | 1,844 | 104,294 | 83,275,665 | |
20-07 | 2,118 | 103,306 | 78,413,612 | |
20-08 | 2,031 | 89,230 | 74,161,690 | |
20-09 | 1,964 | 69,506 | 57,611,101 | |
20-10 | 2,061 | 68,357 | 57,135,670 | |
20-11 | 2,262 | 61,952 | 49,283,628 | |
20-12 | 2,420 | 65,883 | 50,801,864 | |
21-01 | 2,428 | 74,832 | 56,971,166 | |
21-02 | 2,411 | 56,103 | 45,191,176 | |
21-03 | 3,724 | 68,563 | 58,286,852 | |
21-04 | 2,346 | 64,999 | 51,301,395 | |
21-05 | 2,233 | 62,330 | 49,119,113 | |
21-06 | 2,207 | 56,038 | 40,039,331 | |
21-07 | 2,154 | 51,792 | 38,740,273 | |
21-08 | 2,044 | 52,279 | 41,821,384 | |
21-09 | 2,085 | 47,898 | 37,416,152 | |
21-10 | 1,984 | 40,733 | 33,838,525 | |
21-11 | 1,875 | 41,516 | 33,866,628 | |
21-12 | 1,987 | 54,626 | 42,167,130 | |
22-01 | 2,107 | 77,579 | 58,414,235 | |
22-02 | 1,907 | 42,989 | 34,357,821 | |
22-03 | 1,640 | 29,870 | 25,246,360 | |
22-04 | 1,084 | 28,752 | 24,946,323 | |
22-05 | 1,113 | 26,092 | 22,250,946 | |
22-06 | 930 | 22,848 | 18,630,103 | |
22-07 | 1,056 | 24,367 | 20,619,365 | |
22-08 | 1,120 | 21,577 | 19,140,381 | |
22-09 | 915 | 16,899 | 15,515,618 | |
22-10 | 874 | 15,099 | 13,608,839 | |
22-11 | 765 | 15,697 | 14,410,411 | |
22-12 | 794 | 17,586 | 15,281,307 |