English-Corpora.org

GLOWBE CORPUS

Corpus design and overview
Corpus construction
Remaining issues (duplicate texts)
(Examples of searches / dialectal variation)

CORPUS DESIGN AND OVERVIEW (# web sites (distinct domains), web pages, and words).

Country	Code	General (may also include blogs)			(Only) Blogs			Total
		Sites	Pages	Words	Sites	Pages	Words	Sites	Pages	Words
United States	US	43,249	168,771	253,536,242	48,116	106,385	133,061,093	82,260	275,156	386,809,355
Canada	CA	22,178	81,644	90,846,732	16,745	54,048	43,814,827	33,776	135,692	134,765,381
Great Britain	GB	39,254	232,428	255,672,390	35,229	149,413	131,671,002	64,351	381,841	387,615,074
Ireland	IE	12,978	75,432	80,530,794	5,512	26,715	20,410,027	15,840	102,147	101,029,231
Australia	AU	19,619	81,683	104,716,366	13,516	47,561	43,390,501	28,881	129,244	148,208,169
New Zealand	NZ	11,202	54,862	58,698,828	4,970	27,817	22,625,584	14,053	82,679	81,390,476
India	IN	11,217	76,609	68,032,551	9,289	37,156	28,310,511	18,618	113,765	96,430,888
Sri Lanka	LK	3,307	25,310	33,793,772	1,672	13,079	12,760,726	4,208	38,389	46,583,115
Pakistan	PK	3,070	25,852	38,005,985	2,899	16,917	13,332,245	4,955	42,769	51,367,152
Bangladesh	BD	4,415	30,813	28,700,158	2,332	14,246	10,922,869	5,712	45,059	39,658,255
Singapore	SG	5,775	28,332	29,229,186	4,255	17,127	13,711,412	8,339	45,459	42,974,705
Malaysia	MY	6,225	29,302	29,026,896	4,591	16,299	13,357,745	8,966	45,601	42,420,168
Philippines	PH	6,169	28,391	29,758,446	5,979	17,951	13,457,087	10,224	46,342	43,250,093
Hong Kong	HK	6,720	27,896	27,906,879	2,892	16,040	12,508,796	8,740	43,936	40,450,291
South Africa	ZA	7,318	28,271	31,683,286	4,566	16,993	13,645,623	10,308	45,264	45,364,498
Nigeria	NG	3,448	23,329	30,622,738	2,072	13,956	11,996,583	4,516	37,285	42,646,098
Ghana	GH	3,161	32,189	27,644,721	1,053	15,162	11,088,160	3,616	47,351	38,768,231
Kenya	KE	4,222	31,166	28,552,920	2,073	14,796	12,480,777	5,193	45,962	41,069,085
Tanzania	TZ	3,829	27,533	24,883,840	1,414	13,823	10,253,840	4,575	41,356	35,169,042
Jamaica	JM	3,049	30,928	28,505,416	1,049	15,820	11,124,273	3,488	46,748	39,663,666
TOTAL		220,405	1,140,741	1,300,348,146	170,224	651,304	583,923,681	340,619	1,792,045	1,885,632,973

CREATION OF THE CORPUS

The web pages were collected in December 2012, using the following process:

1. The list of web pages was created by running hundreds of high frequency COCA n-grams (e.g. and from the, but it is) against Google to generate essentially "random" web pages (e.g. presumably there would be no AdSense entries or meaningful page rankings for phrases like and from the).
2. We repeated this process for each of the 20 different countries (e.g. United States, New Zealand, Sri Lanka, Hong Kong), and limited the country by using Google "Advanced Search" [Region] function. The question, of course, is how well Google knows which country a page comes from, if it isn't marked by a top-level domain (e.g. ZA for South Africa). As Google explains,

"we'll rely largely on the site's 1 country domain (.ca, .de, etc.). If an international domain (.com, .org, .eu, etc) has been used, we'll rely on several signals, including 2 IP address, 3 location information on the page, 4 links to the page, and 5 any relevant information from Google Places."

For example, for a .com address (where no top-level domain is listed), it will try to use the IP address (which shows where the computer is physically located). But even if that fails, Google could still see that 95% of the visitors to the site come from Singapore, and that 95% of the links to that page are from Singapore (and remember that Google knows both of these things), and it would then guess that the site is probably from Singapore. It isn't perfect, but it's very, very good, as is shown in the results from the dialect-oriented searches.

3. In addition, besides doing 20 different sets of searches (for each of the 20 countries) with "General" Google searches (all web pages), we also repeated this with Google "Blog" searches (using the Advanced / Region searches in both cases). The blog searches are obviously just blogs, and the "General" searches also included some blogs as well.
4. We then downloaded all of the two million unique web pages using HTTrack.
5. After this, we ran all of the two million web pages through JusText to remove boilerplate material (e.g. headers, footers, sidebars). (Thanks, Michael Bean, for helping to set this up)
6. Finally, we used n-gram matching to eliminate the remaining duplicate texts, although there are undountedly still some duplicates in the corpus (see below).

REMAINING ISSUES: DUPLICATE TEXTS

One of the problems with a web page-based corpus is the duplicate text that you will find on different pages, and even within the same page. For example, there might be 10-15 pages from the same website that include a copyright notice (e.g. ...you are not permitted to copy this text...). Or there might be a web page with reader comments, in which a comment at the top of the page gets repeated two or three times later on that page.

We have used several methods to remove these duplicates:

1. As we created lists of web pages from Google searches (see above), we only used each web page once, even if it was generated by multiple searches.
2. JusText removed most boilerplate material (e.g. headers, footers, sidebars), which contains a lot of duplicate material on pages from the same website.
3. Once we had downloaded all 1.8 million web pages, we then searched for duplicate n-grams (primarily 11-grams, in our case), looking for long strings of words that are repeated, such as "This newspaper is copyrighted by Company_X. You are not permitted..." ( = 11 words, including punctuation). We ran these searches many times, in many different ways, trying to find and eliminate duplicate texts, and also duplicate strings within different texts.

Even with these steps, however, there are still duplicate texts and (more commonly) duplicate portions of text in different pages, especially since the corpus is so big (1.9 billion words, in 1.8 million web pages). It will undoubtedly be impossible to eliminate every single one of these duplicates. But at this point, we are continuing to do the following:

4. In the Keyword in Context (KWIC) display (lower right-hand corner of the corpus interface), you will see a number in parentheses (e.g. (1) ) after web pages where there was a duplicate.
5. As these duplicates are found -- one by one as KWIC displays are generated for thousands of corpus users -- they will get logged in the database. Every month or so, we will run scripts to eliminate these duplicate texts / strings. In this way, the corpus will continue to get "cleaner and cleaner" over time.

One final issue: what do about intra-page duplicates, i.e. cases where the same text is copied on the same web page. As was mentioned above, there might be a web page with reader comments, in which a comment at the top of the page gets repeated two or three times later on that page. Our approach at this point is to log these in the database as users do KWIC displays (#5 above), but to not delete the duplicates at this point. If a comment is copied on a page, it may be because the comment is an important one, and perhaps it deserves to be preserved twice in the corpus. We're still debating on this, however.

If you have feedback on any of these issues, please feel free to email us. Thanks.