SPEED
For very large corpora,
Sketch Engine is just about the fastest corpus
architecture available.
Our architecture, however, is even faster -- about 10-15 times as fast, on average, for "string searches" like
those shown below. This means
that with a large corpus like
iWeb, for example, you might spend 5 minutes
doing a series of
searches, whereas it would take you a little more than an hour (i.e. 60 minutes
just sitting there, waiting
for results) in a similar-sized corpus in Sketch Engine.
The following data is based on the 14.0 billion
iWeb corpus and the 13.2
billion word
enTenTen15
corpus from Sketch Engine (since these two corpora are roughly the same size).
The first two columns show the search strings in iWeb and enTenTen15. The last
four columns show the speed (in seconds) in iWeb and SketchEngine. The (prelim[inary])
column provides a preliminary estimate of how much faster the search is in iWeb
(e.g. 6.8 times as fast for ADJ plans). Because iWeb is a bit larger than
enTenTen15 (and so any search should take a little bit longer than enTenTen15),
the [x faster] column (rightmost column) takes into account this difference.
As you can see, the English-Corpora.org
architecture is about 10-15 times as fast as the Sketch Engine interface.
3 As far as we are aware, this is the fastest architecture
available for any full-featured, structured corpora.
|
Note: click on any link
on this page to see the corpus data, and then
click on the "BACK" image (see left) at the top of the page to come back to
this page. |
iWeb (14 billion
words) 1 |
Sketch Engine:
enTenTen15 (13.2 billion words) 2 |
iWeb |
SE |
(Prelim) |
x faster |
ADJ plans
|
[tag = "J.*"] [word = "plans"]
|
3.7
|
25
|
6.8
|
7.1
|
long NOUN *
|
[word = "long"] [tag = "N.*"]
|
7.6
|
103
|
13.6
|
14.3
|
I VERB
whether
|
[word = "I"] [tag = "V.*"] [word
= "whether"]
|
3.2
|
54
|
16.9
|
17.8
|
never really
VERB+ *
|
[word = "never"] [word =
"really"] [tag = "V.*"]
|
5.6
|
47
|
8.4
|
8.8
|
the best NOUN
*
|
[word="the"] [word="best"]
[tag="N.*"]
|
9.2
|
171
|
18.6
|
19.6
|
ADV ADJ
places
|
[tag = "R.*"] [tag = "J.*"]
[word = "places"]
|
3.7
|
34
|
9.2
|
9.7
|
VERB them make *
|
[tag = "V.*"] [word = "them"]
[word = "make"]
|
10.3
|
39
|
3.8
|
4.0
|
NOUN PRON BUY
|
[tag="N.*"] [tag="PP.*"]
[lemma="buy"]
|
4.8
|
86
|
17.9
|
18.9
|
THINK PRON
VERB+ *
|
[lemma="think"] [tag="PP.*"]
[tag="V.*"]
|
7.3
|
67
|
9.2
|
9.7
|
DO NEG it
seem
|
[lemma="do"] [word="n't"]
[word="seem"] [word="to"] [tag="V.*"]
|
1.9
|
44
|
23.2
|
24.4
|
VERB her way
PREP *
|
[tag="V.*"] [word="her"]
[word="way"] [tag="IN.*"]
|
10.3
|
80
|
7.8
|
8.2
|
VERB through
the NOUN *
|
[tag="V.*"] [word="through"]
[word="the"] [tag="N.*"]
|
10
|
210
|
21.0
|
22.1
|
Notes:
1. Click on the link to do the
search in iWeb. If there is an asterisk after the search, the first
results will be from pre-calculated "n-grams" tables, which should be much
faster than the times shown here. But it wouldn't really be a fair comparison to
Sketch Engine, since Sketch Engine doesn't have pre-calculated n-grams tables.
Therefore, to have a "fair" comparison and to search "from scratch", click on
"Use Large N-grams" and then "See Full List" in the iWeb results.
2. To search in Sketch Engine,
select Concordance / Advanced / CQL and then insert the CQL string in Sketch
Engine. Once it starts showing the KWIC results, click on the [Frequency] icon
in the row of icons at the top. The time shown in the SE (Sketch Engine) column
above is the combined time from when it starts displaying KWIC results and when
it finally produces the frequency list of matching strings, with about 2-3
seconds subtracted to click on the [Frequency] link.
3. Some people might wonder why
we haven't compared our results to
CQPWeb as well. This
is because CQPWeb is limited to corpora of 2 billion words or less, and so there
are no comparable corpora (10-15 billion words) in the CQPWeb format.
Preliminary results from very small corpora like the BNC, however, show that
Sketch Engine is much faster than CQPWeb. So by extension, English-Corpora.org
is 10-15 times faster than that (or probably 30-40 times as fast as CQPWeb).
|