The Blog and General web page texts
are a subset of the texts from the US in the
GloWbE corpus. The web pages were collected in December 2012, using the following
process:
1. The list of web pages was created by running
hundreds of high
frequency COCA n-grams (e.g. and from the, but it is) against
Google to generate essentially "random" web pages
(e.g. presumably there would be no AdSense entries or meaningful
page rankings for phrases like and from the).
2. We repeated this process for each of
the 20 different countries (e.g. United States, New Zealand,
Sri Lanka, Hong Kong) in the GloWbE corpus, and limited the country by using Google "Advanced Search"
[Region] function. The question, of course, is how well Google knows
which country a page comes from, if it isn't marked by a top-level
domain (e.g. US for United States).
As Google explains,
"we'll rely largely on the site's
1
country domain (.ca, .de, etc.). If an international domain (.com,
.org, .eu, etc) has been used, we'll rely on several signals,
including 2 IP address,
3
location information on the page,
4
links to the page, and 5 any relevant
information from Google Places."
For example, for a .com address
(where no top-level domain is listed), it will try to use the IP address
(which shows where the computer is physically located). But even if that fails, Google
could still see that 95% of the visitors to the site
come from the US, and that 95% of the links to that page are from
the US (and remember that Google knows both of these things),
and it would then guess that the site is probably from the US. It
isn't perfect, but it's very, very good (see the
EWW 2013 article for more information)..
3. For the Blog and General Web portions of COCA, we simply
took data from the US portion of GloWbE.
4. When GloWbE was created in late 2012, it was possible to
do "General"
Google searches (all web pages), as well as "Blog"
searches on Google (these have now been deprecated). The blog searches are obviously just blogs, and the "General" searches
also included some blogs as well.
4. We then downloaded all of the two
million unique web pages using
HTTrack.
5. After this, we ran all of the two
million web pages through
JusText
to remove boilerplate material (e.g. headers, footers, sidebars).
6. We used n-gram matching to
eliminate the remaining duplicate texts. Even more difficult was the
removal of duplicate "snippets" of text on multiple web pages (e.g.
legal notices or information on the creator of a blog or a newspaper
columnist), which JusText didn't eliminate.
7. In Summer 2019,
Serge Sharoff
categorized these texts from GloWbE-US into web genres. We took
the subset of the GloWbE-US texts with the most accurate
categorization (according to Sharoff's algorithms) and those are the
texts that are found in COCA.
|