The n-grams contain the frequency of
each 1-gram, 2-gram, 3-gram, 4-gram, and 5-gram (one, two, three,
four, and five word sequences) in a particular dataset (American,
British, One Million Books, or Fiction). For example, for the Americaqn
English dataset (155 billion words of data):
n-gram |
Rows of data in
downloadable Google books files |
distinct n-grams (types) |
|
# files |
# rows per file (avg) |
total # rows |
|
1-gram |
10 |
29,135,000 |
2,913,500,000 |
4,193,531 |
2-gram |
100 |
39,225,000 |
3,922,500,000 |
62,266,626 |
3-gram |
200 |
61,838,000 |
12,367,600,000 |
204,651,081 |
4-gram |
400 |
37,775,000 |
15,110,000,000 |
265,717,919 |
5-gram |
800 |
37,787,000 |
30,229,600,000 |
192,744,071 |
The original n-grams
files that can be downloaded from Google Books (columns 2-4 above) provide separate entries
for each n-gram in each year (e.g. perfectly happy to is
listed 60 times, showing its frequency in each of 60 different
years).
n-gram |
year |
frequency |
# books |
# pages |
quite happy to do |
1976 |
11 |
11 |
11 |
quite happy to do |
1977 |
2 |
2 |
2 |
quite happy to do |
1978 |
1 |
1 |
1 |
quite happy to do |
1979 |
7 |
7 |
7 |
quite happy to do |
1980 |
7 |
7 |
7 |
We then processed these
files to give just one row of data for each distinct n-gram, with
the total frequency in each decade. For example, the following is
the one row for quite happy to do:
n-gram |
total |
1810s |
... |
... |
... |
1850s |
... |
... |
... |
... |
... |
|
1920s |
... |
... |
... |
... |
... |
1980s |
... |
2000s |
quite happy to do |
281 |
0 |
0 |
0 |
1 |
1 |
0 |
0 |
0 |
0 |
4 |
0 |
6 |
1 |
3 |
12 |
30 |
32 |
43 |
62 |
86 |
It is tables like this
-- which are stored in relational databases -- that users search
when they do a query.
|