GLOWBE CORPUS
CORPUS DESIGN AND OVERVIEW (# web sites (distinct domains), web pages, and
words).
Country |
Code |
General (may also
include blogs) |
(Only) Blogs |
Total |
|
|
Sites |
Pages |
Words |
Sites |
Pages |
Words |
Sites |
Pages |
Words |
United States |
US |
43,249 |
168,771 |
253,536,242 |
48,116 |
106,385 |
133,061,093 |
82,260 |
275,156 |
386,809,355 |
Canada |
CA |
22,178 |
81,644 |
90,846,732 |
16,745 |
54,048 |
43,814,827 |
33,776 |
135,692 |
134,765,381 |
Great Britain |
GB |
39,254 |
232,428 |
255,672,390 |
35,229 |
149,413 |
131,671,002 |
64,351 |
381,841 |
387,615,074 |
Ireland |
IE |
12,978 |
75,432 |
80,530,794 |
5,512 |
26,715 |
20,410,027 |
15,840 |
102,147 |
101,029,231 |
Australia |
AU |
19,619 |
81,683 |
104,716,366 |
13,516 |
47,561 |
43,390,501 |
28,881 |
129,244 |
148,208,169 |
New Zealand |
NZ |
11,202 |
54,862 |
58,698,828 |
4,970 |
27,817 |
22,625,584 |
14,053 |
82,679 |
81,390,476 |
India |
IN |
11,217 |
76,609 |
68,032,551 |
9,289 |
37,156 |
28,310,511 |
18,618 |
113,765 |
96,430,888 |
Sri Lanka |
LK |
3,307 |
25,310 |
33,793,772 |
1,672 |
13,079 |
12,760,726 |
4,208 |
38,389 |
46,583,115 |
Pakistan |
PK |
3,070 |
25,852 |
38,005,985 |
2,899 |
16,917 |
13,332,245 |
4,955 |
42,769 |
51,367,152 |
Bangladesh |
BD |
4,415 |
30,813 |
28,700,158 |
2,332 |
14,246 |
10,922,869 |
5,712 |
45,059 |
39,658,255 |
Singapore |
SG |
5,775 |
28,332 |
29,229,186 |
4,255 |
17,127 |
13,711,412 |
8,339 |
45,459 |
42,974,705 |
Malaysia |
MY |
6,225 |
29,302 |
29,026,896 |
4,591 |
16,299 |
13,357,745 |
8,966 |
45,601 |
42,420,168 |
Philippines |
PH |
6,169 |
28,391 |
29,758,446 |
5,979 |
17,951 |
13,457,087 |
10,224 |
46,342 |
43,250,093 |
Hong Kong |
HK |
6,720 |
27,896 |
27,906,879 |
2,892 |
16,040 |
12,508,796 |
8,740 |
43,936 |
40,450,291 |
South Africa |
ZA |
7,318 |
28,271 |
31,683,286 |
4,566 |
16,993 |
13,645,623 |
10,308 |
45,264 |
45,364,498 |
Nigeria |
NG |
3,448 |
23,329 |
30,622,738 |
2,072 |
13,956 |
11,996,583 |
4,516 |
37,285 |
42,646,098 |
Ghana |
GH |
3,161 |
32,189 |
27,644,721 |
1,053 |
15,162 |
11,088,160 |
3,616 |
47,351 |
38,768,231 |
Kenya |
KE |
4,222 |
31,166 |
28,552,920 |
2,073 |
14,796 |
12,480,777 |
5,193 |
45,962 |
41,069,085 |
Tanzania |
TZ |
3,829 |
27,533 |
24,883,840 |
1,414 |
13,823 |
10,253,840 |
4,575 |
41,356 |
35,169,042 |
Jamaica |
JM |
3,049 |
30,928 |
28,505,416 |
1,049 |
15,820 |
11,124,273 |
3,488 |
46,748 |
39,663,666 |
TOTAL |
|
220,405 |
1,140,741 |
1,300,348,146 |
170,224 |
651,304 |
583,923,681 |
340,619 |
1,792,045 |
1,885,632,973 |
CREATION OF THE CORPUS
The web pages were collected in December 2012, using the following process:
1. The list of web pages was created by running
hundreds of high
frequency COCA n-grams (e.g. and from the, but it is) against
Google to generate essentially "random" web pages
(e.g. presumably there would be no AdSense entries or meaningful
page rankings for phrases like and from the).
2. We repeated this process for each of
the 20 different countries (e.g. United States, New Zealand,
Sri Lanka, Hong Kong), and limited the country by using Google "Advanced Search"
[Region] function. The question, of course, is how well Google knows
which country a page comes from, if it isn't marked by a top-level
domain (e.g. ZA for South Africa).
As Google explains,
"we'll rely largely on the site's
1
country domain (.ca, .de, etc.). If an international domain (.com,
.org, .eu, etc) has been used, we'll rely on several signals,
including 2 IP address,
3
location information on the page,
4
links to the page, and 5 any relevant
information from Google Places."
For example, for a .com address
(where no top-level domain is listed), it will try to use the IP address
(which shows where the computer is physically located). But even if that fails, Google
could still see that 95% of the visitors to the site
come from Singapore, and that 95% of the links to that page are from
Singapore (and remember that Google knows both of these things),
and it would then guess that the site is probably from Singapore. It isn't perfect, but it's
very, very good, as is shown in the results from the
dialect-oriented searches.
3. In addition, besides doing 20
different sets of searches (for each of the 20 countries) with "General"
Google searches (all web pages), we also repeated this with Google "Blog"
searches (using the Advanced / Region searches in both cases). The
blog searches are obviously just blogs, and the "General" searches
also included some blogs as well.
4. We then downloaded all of the two
million unique web pages using
HTTrack.
5. After this, we ran all of the two
million web pages through
JusText
to remove boilerplate material (e.g. headers, footers, sidebars).
(Thanks,
Michael Bean, for helping to set this up)
6. Finally, we used n-gram matching to
eliminate the remaining duplicate texts, although there are
undountedly still some duplicates in the corpus (see below).
REMAINING ISSUES: DUPLICATE TEXTS
One of the problems with a web
page-based corpus is the duplicate text that you will find on
different pages, and even within the same page. For example, there
might be 10-15 pages from the same website that include a copyright
notice (e.g. ...you are not permitted to copy this text...).
Or there might be a web page with reader comments, in which a
comment at the top of the page gets repeated two or three times
later on that page.
We have used several methods to remove
these duplicates:
1. As we created lists of web pages
from Google searches (see above), we only used each
web page once, even if it was generated by multiple searches.
2.
JusText
removed most boilerplate material (e.g. headers, footers, sidebars),
which contains a lot of duplicate material on pages from the same
website.
3. Once we had downloaded all 1.8 million web pages, we then
searched for duplicate n-grams (primarily 11-grams, in our case),
looking for long strings of words that are repeated, such as "This
newspaper is copyrighted by Company_X. You are not permitted..."
( = 11 words, including punctuation). We ran these searches many
times, in many different ways, trying to find and eliminate
duplicate texts, and also duplicate strings within different texts.
Even with these steps, however, there
are still duplicate texts and (more commonly) duplicate portions of
text in different pages, especially since the corpus is so big (1.9
billion words, in 1.8 million web pages). It will undoubtedly be impossible to eliminate
every single one of these duplicates. But at this point, we are
continuing to do the following:
4. In the Keyword in Context (KWIC)
display (lower right-hand corner of the
corpus interface), you will see a number in
parentheses (e.g. (1) ) after web pages where there was a duplicate.
5. As these duplicates are found -- one by one as KWIC displays are
generated for thousands of corpus users -- they will get logged in the database.
Every month or so, we will run scripts to eliminate these duplicate
texts / strings. In this way, the corpus will continue to get
"cleaner and cleaner" over time.
One final issue: what do about
intra-page duplicates, i.e. cases where the same text is copied
on the same web page. As was mentioned above, there might be a web
page with reader comments, in which a comment at the top of the page
gets repeated two or three times later on that page. Our approach at
this point is to log these in the database as users do KWIC displays
(#5 above), but to not delete the duplicates at this point. If a
comment is copied on a page, it may be because the comment is an
important one, and perhaps it deserves to be preserved twice in the
corpus. We're still debating on this, however.
If you have feedback on any of these
issues, please feel free to email us.
Thanks.
|