At English-Corpora.org, we're introducing a new way to interact with corpus data.
Using Large Language Models (LLMs) like GPT, Gemini, and Claude, users will soon be able to have collocates, phrases, and frequency data
clustered, categorized, and explained automatically.
The underlying corpus data remains unchanged — but AI will provide an optional layer of analysis to help users spot patterns and connections more quickly. Corpora first. AI second. Insights faster. [Learn More]
The NOW corpus (News on the Web)
has been created by Mark Davies, and it
contains 21.7 billion words of data from web-based newspapers and magazines from 2010
to the present time (the most recent day is 2022-11-10). More importantly, the corpus grows by about 270-290 million
words of data each month (from about 470,000 new articles), or about 3.1 billion words each year.
While other resources like Google Trends
show you what people are searching for, the NOW Corpus is the only structured corpus that shows you what is actually
happening in the language -- virtually right up to the present time. For example,
see the
frequency of words since 2010, as well as
new words and phrases from the last few years.
In this sense, NOW is the most robust
monitor corpus of English.
Click on any of the links in the search form to the left (such as
List or
Chart) for
context-sensitive help, and to see the range of queries that the corpus
offers. You might pay special attention to the
comparisons between
dates and countries and virtual corpora,
which allow you to create personalized collections of texts based on (sub-)register, website, and even words in the web pages.
And you might want to check out the new expanded help files.
Find single words like
mysterious,
all forms of a word like JUMP,
words matching patterns like
*break*,
phrases like more * than or
rough NOUN.
You can also search by synonyms (e.g.
gorgeous), and
customized wordlists like
clothes. In each case, you see each individual
matching string.
An easy way to use part of speech tags is by selecting them from the drop-down
list (click on [PoS] to show it). You can also type the part of speech tags
directly into the search form.
Previously, you had to use the part of speech tag (from the link above) inside
of brackets, e.g. [j*]. But that's a bit cumbersome for mobile phones, and there
are now different ways of specifying the part of speech -- all of which work
equally as well. For example, all of the following would find the same strings:
ADJ eyes,
[j*] eyes,
J eyes,
_j eyes.
1
2
3
4
Original
New (word)
New (abbrev)
CQP-like
Explanation
Example
[nn*]
NOUN
N
_nn
Common nouns
sun, love
[np*]
NAME
NP
_np
Proper nouns
John, Chicago
[n*]
NOUN+
N+
_n
Common and proper nouns
sun, Sonny
[vv*]
VERB
V
_vv
Lexical verb (no do, be, have)
decide, jumped
[v*]
VERB+
V+
_v
All verbs (incl do, be, have)
decide, has, is
[j*]
ADJ
J
_j
Adjectives
nice, clean
[r*]
ADV
R
_r
Adverbs
soon, quickly
[p*]
PRON
_p
Pronouns
she, everyone
[i*]
PREP
_i
Prepositions
from, on
[a*]
ART
_a
Articles
the, his
[d*]
DET
_d
Determiners
these, all
[c*]
CONJ
_c
Conjunctions
that, and, or
[x*]
NEG
_x
Negation
not, n't
[m*]
NUM
_m
Numbers
five, 5
All other parts of speech: use Type 1 or Type 4, e.g. [nn2*], _nn2, [cst*], _cst
If you are using Type 1 or Type 4 above, you can use wildcards for the part of speech tag. For example,
[nn2*] = plural nouns,
[n*] = all nouns,
[*n*] = nouns
(including ambiguous noun/adj tags), etc. If you are using Type 2 or Type 3, it needs to be upper case:
short NOUN.
You can also add a part of speech tag to the end of any word,
but you need to use either Type 1 or Type 4 above. For example,
end would find end with any part of speech, but
end.[n*]
or
end_n would limit it
to end as a noun, and
end_v
or end_v would limit it
to end as a verb. Make sure that you separate the word and the part of speech with a period / full stop
and bracket (Type 1) or an underscore (Type 4), and that there is no space.
Remember also that you can combine these with lemma searches to find all forms
of a word with a given part of speech, e.g.
END_v or
END.[v*].
If you don't know what the part of
speech tag is for a given word (or the words in a phrase), just
select [OPTIONS] and then [GROUP BY] = [NONE] (SHOW POS). For
example, see the PoS tags for
light,
back,
front, or
in light of
If you capitalize an entire word, it will find all forms of that word. For example,
DECIDE would find
all forms of decide (decide, decides, decided, deciding), whereas
decide would just find
the single form decide.
Another example: =CLEAN
would find all of the
synonyms of clean (scour, scoured, polish, polishes, etc), whereas
=clean would just find
scour, polish, etc. (Notice that we have also added the
part of speech_v to the end, to limit these to verbs.)
You can search by all of the synonyms of a given word, which provides powerful "semantically-based" searches of the corpus. For example, you can find
the synonynms of
beautiful,
nonsense, or
clean (v).
Of course you can use the synonyms as part of phrases as well. For example,
=CLEAN * NOUN,
=clever =man, or
=strong ARGUMENT.
As the last example shows, synonyms can be very useful when you are a non-native speaker, and you want to know which related words are used in a particular context.
As =clean * NOUN shows, not
every token will actually be a synonym of a given word in every case. For example, scour may be a synonym of clean in scour the sink, but
not in scour the library for good books.
Note the it is often useful to limit the synonyms to those with a particular part of speech, as in
clean_v. It is
often also useful to find all forms of the synonyms, by capitalizing the word:
CLEAN. And of course
you can combine these as well, for example
all forms of all synonyms of clean as a verb.
Finally, note that you can click on the [S] to find synonyms for each word in the results set. This allows you to follow a "synonym chain" from one word to another to another...
The Hansard and EEBO corpora have been "semantically tagged", and you can use these tags as part of your search. A few examples are given below.
"The mind: Idealism" + [all
forms of all synonyms of idea]
abstract ideas, implicite belief
"User lists" or "customized lists" are word lists that you create -- related to a certain
topic (e.g. sports, clothing, or emotions), words that are grammatically related (e.g. a
certain subset of adverbs or pronouns), or any other listing that you
might want. For example,
click here to run a query based on two sample word lists that we
created -- one with a list of colors, and the other with a short list of
parts of clothing.
You can later view the lists that you have created, and modify
the wordlist (add or delete words), or delete a list entirely.
Once
created, you can re-use a wordlist in queries at any time in the future
-- they remain stored in the database on the server. The easiest way to
include a list in the main search window is to just select it in the
wordlist window. If desired, you can also type it into the search form
directly. The format is:
@listName
e.g.:
@foods
@emotions
You can also use the list as part of a phrase:
was quite @beautiful
LIKE_v playing @sports
3. Select and de-select words from the list by clicking in the checkbox
to the left of each word. Only the words that you select will be
saved to the list. You can use the checkbox to the left of the
[CONTEXT] button to select or de-select the entire list.
4. Enter the name you want to
give to the list (in this case, maybe beautiful-syn).
5. Make sure you really have selected some words (step 2 above), and then click [Submit] to save your list.
6. If you want, select the list that you've saved in the
customized wordlists interface. You can add to the list, modify
entries (click M), or delete words from the list.
7. Finally, you can then re-use this list as part of subsequent queries.
For example, if [mark_davies@byu.edu] has created and stored the list [beautiful-syn]
then he could find cases of
was ADV followed by one of these adjectives.
Many of the examples shown in the other sections are for individual words. But you can combine the different types of searches to create fairly complex phrases. For example:
would find strings with PUT at the beginning and away at the end,
with up to three words between, at least one of which has to be a NOUN. In other
words, it would do the following seven searches, one right after another, and
would then display the results for all of the searches on one page.
Searches (done one right after another)
Matching strings
1
PUT away
put away (no words in between)
2
PUT NOUN away
put toys away
3
PUT * NOUN away
put the toys away
4
PUT NOUN * away
put toys far away
5
PUT * * NOUN away
put the fun toys away
6
PUT * NOUN * away
put the toys far away
7
PUT NOUN
* * away
put toys and crayons away
In terms of search syntax, note that:
1. {n} indicates the number of words (0 to n) that
can be in this "variable length" string. Valid numbers are 1, 2, or 3 (in other
words, the longest variable length string is three words)
2. If you don't indicate {n} -- for example (NOUN) -- then it would be just one word -- meaning that it will be either that one word
or nothing
3. Any "slot" without parentheses around it is
obligatory. For example, put * away would not match put away,
since * doesn't have parentheses around it.
4. You can't include multiple "flex" operators in a
search. For example, they (VERB+}{2} notice (NOUN){3} would not be
possible.
The following are some additional searches. They produce interesting results in the one billion word COCA corpus),
but the results in other corpora may not be as good. In each case, we show a few sample matching strings, and some strings that would not be generated
by the search (and why not).
take it away (it
from *, which is not optional; no other words from {2}, since 0-2 words)
take the money away (the
from *, money (one slot) from {2})
take even more money away (even
from *, more money (two slots) from {2})
take away (* forces at least one
word)
take it quickly away (no
NOUN)
take even more easy money away
(more easy money = 3 words)
I was noticing
I had never even noticed (VERB+
matches any verb, including do, be, have; VERB is only lexical
verbs)
I sometimes notice (no
VERB+)
I had never even ever noticed (4
words; max of 3)
Some additional notes:
1. Because a "flex search" had involve up to seven
different searches (see above), there are some limits on the number of flex
searches in a given 24 hour period. For those who do not have a
premium or
academic
license, there is a limit of five flex searches in 24 hours. Those who do have a
license can do up to 50 flex searches in a 24 hour period.
2. Again, because of the number of searches that
are done in a flex search, it would take a long time to do these searches if all
of the "slots" are high frequency. This can be a real limitation in very large
corpora like NOW (19+ billion
words) or iWeb (14 billion
words). So a search like HAVE (ADJ){3} time probably won't work in those
corpora -- HAVE and time are too high of frequency. In a case like
this, you will probably need to do these as a series of separate searches --
HAVE time, HAVE * time, HAVE * ADJ time, etc. But again, this should be a
problem with a small corpus like the
BNC.
CHART display
If you are interested in a set of words or a grammatical construction, then the LIST option
shows the frequency of each matching form (end up being, ended up saying, etc), while the
CHART option
shows the total frequency in each section.
(in NOW, the years). It is also possible to see the frequency of words and phrases by "week", e.g.
fidget spinner, or
fake news, or
pandemic.
Because the NOW corpus is so large and because it is constantly growing, collocates work differently in NOW than in other corpora.
If you follow the following guidelines, searches should be very fast. If you change any of them, the searches may be very slow or even "time out".
Guideline
Good / fast
Bad / slow
Search by lemmas (capitalize to find all word form; can still limit by part of speech, e.g. _v)
The collocates search finds words near another word (i.e. within a "cloud" of nearby words), whereas the
LIST search finds an exact string of words.
For example:
nouns
near taste would include mouth (taste in his mouth), smell (sense of taste and smell), and matter (a matter of taste). But
because none of these collocates are immediately adjacent to taste they would not be found with the LIST search
NOUN taste or
taste NOUN.
On the other hand, if you want adjectives with taste, the LIST search
ADJ taste might be the
best search, and the collocates search of
adjectives "near" taste might
not add very much. Note that LIST searches are always much faster as well.
For both the WORD and COLLOCATES field, you can include the full range of searches, including words, lemmas, substrings, parts of speech, and synonyms.
For example, the following are
searches for collocates of gap (n):
any word,
nouns,
adjective,
the word fill,
synonyms of large.
Select the "span" (number of words to the left and the right) for the
collocates. Use + to search more than four words to the left or right, and 0 to
exclude the words to the left or right. If you don't select a span, it will
default to 4 words left and 4 words right.
You can use collocates to do "variable length" searches, where there might be 0-4 (or more) words between two other sets of words or phrases.
For example, you could find all of the following with one simple search.
(were) talked --- into coming (0 words)
talk them into coming (1 word)
talk the girls into coming (2 words)
talk some other people into coming (3 words)
talk lots of other people into coming (4 words)
In the sample queries below, you would
enter the following in WORD(S),
COLLOCATES,
and
the maximum
length in words (up to nine words, left and right) between WORD(S) and COLLOCATES. For example, O L | 4 R means the
COLLOCATES are between 0 words to the left and 4 words to the right of
WORD(S).
Click on
A ,
B , or
C
below to run the sample queries.
what|allhe wants
todo BE VERB
e.g.
what|all he wants to do is complain
5 L | 0 R
what|allthey
expected Fred todo BE VERB
7 L | 0 R
what|all
any of these crazy people cando BE VERB
8 L | 0 R
what|all
your best friend can possibly hope todo BE VERB
Note
Use [a*]|[d*]|[n*]|[p*]
to look for the first word of a noun phrase (you may want to refine
this further). You can also use the negator
- to indicate NOT,
e.g. -VERB|ADV
(not verb or adverb) or -to|will|would
(none of these three words). Make sure there is no space to the left or
right of | when there
is a series of elements.
Notes:
1. Not all
of the KWIC entries will in fact be relevant, because we haven't placed
any constraints on what is between the yellow and the green parts of the
search. But using the yellow portion as an "anchor" is still far better
than searching for just the green portion.
2. The green (collocates) portion can only have one word, not a sequence of
two or three words. For this one word, however, there can be any number
of possibilities, such as either what or all in [B] above.
3. Another option is to do a
variable
length phrase/sequence search. The advantage of that approach is
that you can see (and limit) the intervening words. The disadvantage is
that the "variable length" section is limited to three words.
By comparing collocates, you can move far beyond the simplistic entries in a thesaurus, to "tease out" slight differences in words, or (as in the case of
boy and girl ) what is the difference in what is being said about two different things.
Please review the discussion of collocates to see how to select the span for the collocates.
Select the words that you want to sort with. Select L for 1, 2, and 3 words to the left. Select R for 1, 2, and 3 words to the
right. You could
also, for example, sort by one word to the left, then one and two words to the
right. Click * to clear the entries and start over.
websites use that word the most (can use these to create Virtual Corpora).
You can find a wealth of information for the top 40,000 words in the corpus, including:
definitions and synonyms (including links to external dictionaries) and links to external images and videos
frequency information, including frequency by genre and country
collocates (nearby words), which provides insight into meaning and usage
topics (co-occurring words anywhere on the webpage), which provide perhaps even better insight into meaning
concordance lines, to see the patterns in which a word occurs
SECTIONS
SHOW Determines whether the frequency is shown for each "section" of the corpus
(in the case of NOW, the date).
For example, the
synonyms of beautiful in
each section and
overall.
Select a time period (e.g. 2015 or 2015-2016) or country (e.g. US, GB, IN(dia) )
# KWIC is the number of results for a KWIC (concordances) search.
GROUP BY determines whether words are grouped by word form (e.g. decide and decided separately), lemma (e.g. all forms of
decide together), and whether you see the part of speech for word (e.g. beat
as a noun and verb displayed separately).
SHOW # TEXTS determines whether you see the number of texts in which a word or phrase occurs, in addition to its frequency. This can be useful
in finding words and phrases that are limited just to a few texts in the corpus.
(More information)
CASE SENSITIVE determines whether She thought and she thought would be two different searches, or
The Office, the Office, and the office.
DISPLAY shows raw frequency, occurrences per million words, or a combination of these.
SAVE LISTS allows you to create a wordlist from the results and then re-use it later in your searches.
It is often useful to specify the minumim frequency when you are sorting by "relevance", to eliminate
very low frequency strings. For example, collocates of green where
minimum frequency = 1 (strange once-off strings) and where
minimum frequency = 20.
Note also that when you do a collocates search and you don't specify anything for the collocates field, it will automatically set
MINIMUM to MUT INFO = 3 (Mutual Information score). It does this to remove high frequency noise words like the, to, with, etc. If you want to see
more of these words, lower the MI score; to see less, increase it.
Create a "virtual corpus" -- essentially your own personalized corpus within NOW. You can create the corpus either by
keywords in the texts (e.g. texts with the words investments, basketball, or biology), or
information about the texts (e.g.
date, title, or source), or a combination of keyword and text information.
You can then
edit your virtual corpora,
search within a particular virtual corpus,
compare the frequency of a word, phrase or grammatical construction in your different virtual corpora, and also
create "keyword lists" based on the texts in your virtual corpus.
Click on any of the links above for more information.
To create a virtual corpus by keywords, enter a word or phrase to the left, and then set
TEXTS/VIRTUAL to FIND TEXTS (do it /
undo; must be logged in first).
You might also want to set SORT/LIMIT to RELEVANCE and MINIMUM FREQUENCY to something like 5 (the minimum number of times you want the word
to occur in a text) (do it /
undo).
After clicking SUBMIT, you will see a list of matching texts from the corpus. For example, see matching texts for
investment*,
rocket, or
electron.
On the "results" page, choose how many texts you want in your virtual corpus, and then click SAVE LIST.
After the virtual corpus is created, you might
want to click on FIND KEYWORDS to see whether the corpus is providing the focus that you want.
You can create a virtual corpus by selecting texts that match certain criteria -- such as
title of the source (e.g. New York Times) or the title of the article, the topic, the date, and so on.
Click on CREATE CORPUS to the left to see the interface to
select the texts.
As an example, this list was created by searching for articles dealing with refugees in Europe, from Sep 2015.
Note that in that search form, you can also make sure that the texts have certain words in them. If you want more control in finding texts with
certain words, you might want to search by keywords.
See list that was created by searching NOW for articles dealing with refugees in Europe, from Sep 2015.
Explanation: You can add to or delete texts from your virtual corpus, or move texts from one virtual corpus to another. You can also rename and delete corpora,
temporarily "ignore" corpora (for example, when you're comparing corpora. Finally, you can arrange virtual corpora into user-defined categories
(science, religion, sports, etc).
You can see what words occur much more in a particular virtual corpus than in the corpus overall. For example,
see the keywords from the virtual corpus that is composed of articles dealing with refugees in Europe, from Sep 2015.
Once you have created a virtual corpus (by keyword or by
text metadata), then you can search that set of texts as though it were its own corpus.
You can search for matching strings, collocates (nearby words), and retrieve re-sortable concordance (KWIC) lines.
To search one of the corpora, just select it from your list of virtual corpora, and then fill out the rest of the search form as you normally would.
For example, you can search for the word asylum in the
refugees virtual corpus. (Click on the word in the results list, and
you will see that all of the occurrences are from your virtual corpus.)
If you have created multiple virtual corpora, then you can compare the frequency of a word, phrase, or grammatical construction
in these different corpora. Just enter the word or phrase in the search form (as you would do normally), and then select MY CORPORA
(try it; must be logged in first-).