English-Corpora.org

English-Corpora.org



  Sorting / limiting results   (search form, corpora used, corrections, +/- sections)

Note: click on any link on this page to see the corpus data, and then click on the "BACK" image (see left) at the top of the page to come back to this page. Or right click on the link and then "Open link in new tab" (in Chrome; similar in other browsers), and then close that tab after viewing the corpus data.

Limiting entries

At the most basic level, you can set a lower limit for the frequency of the results. For example, if you don't set a lower limit for the search STOP PRON _vvg in the GloWbE corpus (for example stop it happening, where it would be stop it from happening in the US/CA), then you would see all of the entries shown in the list below. But if you set a minimum frequency of 40, then you would see just the first seven entries. This can be the most useful when you're seeing very low frequency entries that don't seem to relate much to the search that you've done.

Sorting entries: phrase

For most searches, the default sorting is by frequency. For example, if you search for ADJ smile, you will see that the most frequent strings are big, little, small + smile. But big, little, and small occur with lots of nouns, not just smile. In other words, they don't really tell us much about smile.

Frequency

But if we sort by relevance, then we see the words that are especially "attracted" to smile. For example, rueful occurs with smile 108 times in the COCA corpus ( 1  below) and rueful occurs only 524 times overall in the corpus ( 2  ). In other words, 20.6% ( 3 ) of all of the tokens of rueful in the corpus occur with smile, and so if someone said the word rueful, native speakers might easily think of smile. The Mutual Information score ( 4 ), which is directly tied to the percentage ( 3 ), is 9.24, which is quite high. And then ( 5 ) provides a bar chart that shows the relative frequency of each of the phrases.

Relevance
(Percentage /
MutInfo)


But you need to be careful when using the "relevance" (percentage / Mutual Information) sorting. If you don't click the checkbox next to [Minimum Frequency], then you will probably get some strange very low frequency phrases. They have a high relevance (Mutual Information) value, in the sense that a high percentage of all of the tokens of a given word (for example cracked-open, below) do occur with the other word (for example smile). In this case, 20% (2 of 10) tokens of cracked-open occur right before smile. But two tokens (in a corpus like COCA, which has one billion words) is pretty small. By default, when you click on [Sorting: Relevance], it will select that checkbox and set a good minimum value for the corpus that you're using (with higher values for larger corpora). You can adjust this if you want -- higher number for fewer low-frequency strings, lower number for more low-frequency strings. But again, if you set it too low (or don't click the checkbox at all), you may get some strange, low-frequency strings.

Relevance (Perc/MI)
No frequency filter

Limiting and sorting with collocates

Above we were looking at strings with a certain number of words, for example the two word string ADJ smile. You can also search for collocates, which are words occurring anywhere within a "cloud of words" around a node word -- for example, ADJ (adjectives) anywhere within a cloud of words from 4 words left to 4 words right of the word car.

With collocates, if you don't manually select anything (including the collocate, such as ADJ), then it will default to [Frequency] with Relevance (Mutual Information) as a filter. What this means is that it will filter out very high frequency words like the, are, and, or thing. These words might be found a lot near your node word (in this case, car), but they are found in many other contexts as well, meaning that they aren't especially related to your node word.

If you set the Mutual Information score to a lower number (such as 1.0) it will show more of these generally high frequency words in English, such as small or use with telescope. If you set it higher (for example 7.0) will show those words that are especially related to the node word -- but at the risk of not showing collocates that provide basic meaning about the word, such as space or sky with telescope. A value of 3.0 is probably a good starting point.

If you want to focus even more on words that are highly correlated with a given word (but again, missing out on more basic words, like space with telescope), then sort by Relevance (which means Mutual Information), as is shown in the image to the left. Mutual Information is basically a measure of how much we can predict the occurrence of a collocate, given a certain node word. For example, given the word wreak, there is a high probability of finding havoc nearby.

But again, if you do use Relevance / Mutual Information, it is important that you set the minimum frequency. The corpus will suggest a value (like 3 or 5 or 10, or for very large corpora like iWeb or NOW maybe even 20 or 50), and it's probably good to at least start with that value. Very briefly, the reason is that a very low frequency word might occur only a few times in the entire corpus. And almost by chance, in those handful of occurrences, it happens to be near your node word. For example, see strange collocates of sunset like out-house, Uluwatu, or 38-F, which have a frequency of just 1 or 2 tokens and which have a high Mutual Information value, but which really don't tell us anything about the meaning of sunset). So the Mutual Information score will think that the two words are high correlated to each other, when they really aren't. If you are sorting by Relevance (Mutual Information), it is good to have a frequency filter, to reduce the chance of these "random co-occurrences".

Comparing the results in two sections

Whenever you compare something in two sections of a corpus -- such as Academic-Medicine and Academic in COCA, or the 1800s and 1900s in COHA, or US/CA and UK/IE in GloWbE, the corpus will automatically set Sorting to Relevance, and it will automatically set a Minimum Frequency level for the first section.

For example, suppose we are comparing ADJ in different sections in COCA. Once we select Academic-Medicine for Section 1 and Academic (general) for Section 2 in COCA, the corpus sets Sorting and Frequency as is seen in (1) below.  (In the following examples, you may need to scroll down the entries in Section 1 to see where Academic-Medicine is selected). We could lower the Minimum Frequency to see even more specific words. Or we could increase the Minimum Frequency to see words that -- while still more common in Section 1 -- are less specific to that section. For example, if we increase it to 1000 (i.e. 1000 tokens of the ADJ in Section 1), we would see the results in (2).

1. Default
(Min Freq = 20
in COCA)
2. Less specific
(Min Freq = 700)

Another way to make sure that the words in Section 1 are not too specific to that section is to require that the words or phrases or collocates occur with at least a certain frequency in Section 2. For example, in (1) above, the adjective doesn't have to occur at all in Section 2. And as a result, we have words like tonsillar, mediastinal, sinonasal, ototoxic that don't occur at all in Section 2 (which is the rest of Academic, not including Academic-Medicine). But suppose we require the adjective to occur at least 20 times outside of the Medicine sub-genre, as shown in (3). (Note that if it occurred 20 times outside of Medicine, it would still be much less common than the 30 tokens in Academic-Medicine, since there are 110 million words of data in Academic outside of Medicine, and "only" 10 million words in Medicine). And so in this case, notice that the list is much "cleaner" and less "Medicine-only" in terms of its vocabulary.

3. Both sections
(Min Freq = 20
in Section 2)

Finally, we could decide not to order the frequency by the relative frequency in Section 1 vs Section 2, but rather just show the most frequent ADJ in each section, regardless of how many times it occurs in the other one, as shown in (4).

4. Frequency,
not relevance

Again, we can compare almost anything between two sections of the corpus -- individual words (as in these cases with ADJ in COCA), phrases (e.g. ADJ + women in COHA: 1870s-1920s vs 1970s-2000s), collocates (e.g. words near scheme in GloWbE: US vs GB), or even synonyms (e.g. =strong in COCA: fiction vs academic) or words that you've created in a customized word list. The important point is that you have complete control over how specific the words or phrases or collocates are in the different sections of the corpus.

Comparing two words

You can compare the collocates (and therefore the meaning and usage) of two words, such as nouns near utter and complete, or warm / hot, or small / little, or the adjectives near boys and girls or Democrats and Republicans, or the objects of destroy / ruin or sanction / approve. See the Compare Words page for basics on how to do one of these searches. This page deals more with how to set the Sorting and Limits options.

[5] By default, the search will sort by [Relevance], and it will show the collocates that are more frequent with one word than with another. But if you want to see which words are the most frequent with each of the two words (regardless of the frequency with the other word), then choose [Frequency]. For example, destroy vs ruin: relevance, frequency (notice people, family, economy high on the list for both verbs)
[6,7] Minimum frequency for the collocates of the first word [1]. This can help limit the collocates to just those that really are frequent. For example, profound vs deep: frequency of at least 20, no limits (notice possibly less useful collocates like uses, arrogance, works). The frequency often depends on the overall size of the corpus, such as perhaps 50 in a large corpus like iWeb to maybe 10 in COCA, and perhaps just 3 or 4 in a small corpus like the BNC. If you do specify a minimum frequency [7], make sure that you also check the box [6].
[8,9] Minimum frequency for the collocates of the second word [2]. Again, if you do specify a number, then make sure you also check the box [9]. Whether or not you input a value for [8] (and select [9]) is often an important part of the setting up the search. For example, look at ADJ before boys vs girls: no limits on the frequency of the collocates of the second word, lower limit of 2 for the collocates of the second word. In the first case, there are a number of words that can only appear with either boys or girls, perhaps because it is the name of a series of books (Hardy Boys) or a girl band (Indigo Girls), or it is an idiomatic phrase (whipping boys), or it can only apply to one of the two in the real world (pregnant girls). In the second case (limit of 2), all of those entries disappear, because there is no book series called the Hardy Girls, no boy band called Indigo Boys, no idiomatic phrase whipping girls, and there are few if any pregnant boys in the real world. On the other hand, in cases like bizarre / strange, destroy / ruin, or sanction / approve, it is not as important to indicate the minimum frequency for the second word.

Other options for sorting

You can also sort the entries alphabetically, although (to be honest) it's not entirely clear why you would want to. For example, we could search for tr* truck, and we would see results like the following. You will probably also want to have a minimum frequency (we have selected 5 in this case), since (because we aren't sorting my something really meaningful, like frequency or relevance) there will probably be a lot of low frequency (1-2 token) words or strings.

One other option for sorting, which does make sense, is to sort by the number of texts that have the word or phrase or collocate. You might want to use this option when you suspect that some of the entries are found in only a small number of texts. For example, a search for President NAME in COHA for the 1820s-1890s gives the results shown below (note that we have also set OPTIONS / SHOW # TEXTS to [YES]. Notice that there are some entries (like President Manning) that occur in only a handful of texts (3 texts, in the case of President Manning). We might not want our results to be skewed by words or phrases that occur in just a few texts.

One option is to set a minimum number of texts (for example, the word or phrase must occur in at least 10 texts). Or we might sort the results by the number of texts that they occur in. In other words, if it really is a frequent word or phrase, then lots of texts should have it. When we sort by the number of different texts that have the matching strings for President NAME, the results seem to be quite good (for texts from the 1800s, of course):