CORPUS DESIGN AND OVERVIEW (# web sites (distinct domains), web pages, and words).

Country Code General (may also include blogs) (Only) Blogs Total
    Sites Pages Words Sites Pages Words Sites Pages Words
United States US 43,249 168,771 253,536,242 48,116 106,385 133,061,093 82,260 275,156 386,809,355
Canada CA 22,178 81,644 90,846,732 16,745 54,048 43,814,827 33,776 135,692 134,765,381
Great Britain GB 39,254 232,428 255,672,390 35,229 149,413 131,671,002 64,351 381,841 387,615,074
Ireland IE 12,978 75,432 80,530,794 5,512 26,715 20,410,027 15,840 102,147 101,029,231
Australia AU 19,619 81,683 104,716,366 13,516 47,561 43,390,501 28,881 129,244 148,208,169
New Zealand NZ 11,202 54,862 58,698,828 4,970 27,817 22,625,584 14,053 82,679 81,390,476
India IN 11,217 76,609 68,032,551 9,289 37,156 28,310,511 18,618 113,765 96,430,888
Sri Lanka LK 3,307 25,310 33,793,772 1,672 13,079 12,760,726 4,208 38,389 46,583,115
Pakistan PK 3,070 25,852 38,005,985 2,899 16,917 13,332,245 4,955 42,769 51,367,152
Bangladesh BD 4,415 30,813 28,700,158 2,332 14,246 10,922,869 5,712 45,059 39,658,255
Singapore SG 5,775 28,332 29,229,186 4,255 17,127 13,711,412 8,339 45,459 42,974,705
Malaysia MY 6,225 29,302 29,026,896 4,591 16,299 13,357,745 8,966 45,601 42,420,168
Philippines PH 6,169 28,391 29,758,446 5,979 17,951 13,457,087 10,224 46,342 43,250,093
Hong Kong HK 6,720 27,896 27,906,879 2,892 16,040 12,508,796 8,740 43,936 40,450,291
South Africa ZA 7,318 28,271 31,683,286 4,566 16,993 13,645,623 10,308 45,264 45,364,498
Nigeria NG 3,448 23,329 30,622,738 2,072 13,956 11,996,583 4,516 37,285 42,646,098
Ghana GH 3,161 32,189 27,644,721 1,053 15,162 11,088,160 3,616 47,351 38,768,231
Kenya KE 4,222 31,166 28,552,920 2,073 14,796 12,480,777 5,193 45,962 41,069,085
Tanzania TZ 3,829 27,533 24,883,840 1,414 13,823 10,253,840 4,575 41,356 35,169,042
Jamaica JM 3,049 30,928 28,505,416 1,049 15,820 11,124,273 3,488 46,748 39,663,666
TOTAL   220,405 1,140,741 1,300,348,146 170,224 651,304 583,923,681 340,619 1,792,045 1,885,632,973


The web pages were collected in December 2012, using the following process:

1. The list of web pages was created by running hundreds of high frequency COCA n-grams (e.g. and from the, but it is) against Google to generate essentially "random" web pages (e.g. presumably there would be no AdSense entries or meaningful page rankings for phrases like and from the).
2. We repeated this process for each of the 20 different countries (e.g. United States, New Zealand, Sri Lanka, Hong Kong), and limited the country by using Google "Advanced Search" [Region] function. The question, of course, is how well Google knows which country a page comes from, if it isn't marked by a top-level domain (e.g. ZA for South Africa). As Google explains,

 "we'll rely largely on the site's 1 country domain (.ca, .de, etc.). If an international domain (.com, .org, .eu, etc) has been used, we'll rely on several signals, including 2 IP address, 3 location information on the page, 4 links to the page, and 5 any relevant information from Google Places."

For example, for a .com address (where no top-level domain is listed), it will try to use the IP address (which shows where the computer is physically located). But even if that fails, Google could still see that 95% of the visitors to the site come from Singapore, and that 95% of the links to that page are from Singapore (and remember that Google knows both of these things), and it would then guess that the site is probably from Singapore. It isn't perfect, but it's very, very good, as is shown in the results from the dialect-oriented searches.

3. In addition, besides doing 20 different sets of searches (for each of the 20 countries) with "General" Google searches (all web pages), we also repeated this with Google "Blog" searches (using the Advanced / Region searches in both cases). The blog searches are obviously just blogs, and the "General" searches also included some blogs as well.
4. We then downloaded all of the two million unique web pages using HTTrack.
5. After this, we ran all of the two million web pages through JusText to remove boilerplate material (e.g. headers, footers, sidebars). (Thanks, Michael Bean, for helping to set this up)
6. Finally, we used n-gram matching to eliminate the remaining duplicate texts, although there are undountedly still some duplicates in the corpus (see below).


One of the problems with a web page-based corpus is the duplicate text that you will find on different pages, and even within the same page. For example, there might be 10-15 pages from the same website that include a copyright notice (e.g. ...you are not permitted to copy this text...). Or there might be a web page with reader comments, in which a comment at the top of the page gets repeated two or three times later on that page.

We have used several methods to remove these duplicates:

1. As we created lists of web pages from Google searches (see above), we only used each web page once, even if it was generated by multiple searches.
2. JusText removed most boilerplate material (e.g. headers, footers, sidebars), which contains a lot of duplicate material on pages from the same website.
3. Once we had downloaded all 1.8 million web pages, we then searched for duplicate n-grams (primarily 11-grams, in our case), looking for long strings of words that are repeated, such as "This newspaper is copyrighted by Company_X. You are not permitted..." ( = 11 words, including punctuation). We ran these searches many times, in many different ways, trying to find and eliminate duplicate texts, and also duplicate strings within different texts.

Even with these steps, however, there are still duplicate texts and (more commonly) duplicate portions of text in different pages, especially since the corpus is so big (1.9 billion words, in 1.8 million web pages). It will undoubtedly be impossible to eliminate every single one of these duplicates. But at this point, we are continuing to do the following:

4. In the Keyword in Context (KWIC) display (lower right-hand corner of the corpus interface), you will see a number in parentheses (e.g. (1) ) after web pages where there was a duplicate.
5. As these duplicates are found -- one by one as KWIC displays are generated for thousands of corpus users -- they will get logged in the database. Every month or so, we will run scripts to eliminate these duplicate texts / strings. In this way, the corpus will continue to get "cleaner and cleaner" over time.

One final issue: what do about intra-page duplicates, i.e. cases where the same text is copied on the same web page. As was mentioned above, there might be a web page with reader comments, in which a comment at the top of the page gets repeated two or three times later on that page. Our approach at this point is to log these in the database as users do KWIC displays (#5 above), but to not delete the duplicates at this point. If a comment is copied on a page, it may be because the comment is an important one, and perhaps it deserves to be preserved twice in the corpus. We're still debating on this, however.

If you have feedback on any of these issues, please feel free to email us. Thanks.