English-Corpora.org

English-Corpora.org


CORPORA AND AI / LLMs: Overview Overview | Notes | Citing

Mark Davies / March 2025   

This page is a summary of seven detailed studies that provide extensive data on how well the predictions of two LLMs -- GPT-4o (from OpenAI) and Gemini 1.5 Pro (from Google) -- match the actual data from large, well-known, publicly-accessible corpora (like those from English-Corpora.org). You can also see a summary of these comments in a video on YouTube.

The seven detailed studies (with a total of 89 pages of discussion and examples) are found here: word frequency, phrase frequency, collocates, comparing words (via collocates), genre-based variation, historical variation, and dialectal variation. As far as I am aware, no other studies have compared LLMs and corpora at this level of detail.

There are two main questions that I will attempt to answer in this summary:

  • How well do the predictions of the LLMs match actual corpus data, and what might account for the accuracy (or inaccuracy) with the specific tasks?

  • What is the role for traditional corpora in the age of LLMs?

In addition, I will end with a couple of notes and suggestions for AI/ML researchers, as well as for corpus linguists.


1. How well do the predictions of the LLMs match actual corpus data?

My original expectations of LLMs were quite low. Although I have been "retired" since 2020, I was a professor of corpus linguistics (the use of large collections of structured language data, for use in linguistic analysis) for more than 35 years. I have spent much of that time creating what have become the most widely used corpora of English throughout the world (as well as for Spanish / Portuguese), and I have used this data myself for many studies on variation in language (genre-based, historical, and dialectal variation).

After all of that time and effort, I was expecting (and almost secretly hoping) that the LLMs -- which had never been explicitly trained to generate this type of linguistic data -- would do very poorly. But I was very surprised at how well they in fact did. The "linguistic understanding" of LLMs varies widely according to the specific task. But for some tasks, they were exceptionally good -- arguably equal to or better than the intuitions of even trained linguists. And for certain tasks, they generated (in my opinion) much better data than even the most advanced corpora.

As far as "explaining" why LLMs do better on specific tasks (meaning that their predictions match corpus data well), I should note that because of their very nature, neural networks (which underlie LLMs) are inherently "opaque", and it is often difficult or impossible to know exactly why a model reached a particular conclusion (see Lipton 2018, Bau 2024, Davies and Khakzar 2024). Even experts in AI, who work with LLMs, may disagree on the "mechanics" of what is going on within a model for any particular output. We also need to be very careful with the "introspections" of LLMs themselves, if we ask them how and why they generated a particular output. They are often no better at analyzing their own decisions and processes than humans, who are notoriously bad at "knowing" precisely why they made a particular decision. So we should keep that in mind in the following sections, as we discuss some factors that may have been involved.

1.1  Collocates (word meaning and usage) and comparing words

Let's start with what the LLMs did exceptionally well -- generating collocates. Collocates are "nearby words", which provide insight into the meaning and usage of a word. For example (nouns) bodice, sprawl, telescope, asthma, kombucha, enzyme (verbs) evoke, excavate, condense, decode (adjectives) infrared, rusty, tactile, recyclable (adverbs) naturally, consciously. As the collocates and "comparing words" (with collocates) pages show, the collocates generated by GPT and Gemini were in most cases much better than the collocates from corpora like Sketch Engine (which is known for its detailed collocates-based "word sketches") and English-Corpora.org. And even in terms of variation across genres, historical periods, and dialects, the data from the LLMs is very accurate and insightful.

How is it that the collocates generated by the LLMs encapsulate so well the meaning and usage of a word? I asked GPT (conversation) and Gemini (conversation) about this, and I found their answers extremely insightful (again, realizing the limitations of their own introspection). ChatGPT suggested the following (and pay particular attention to the highlighted section).

There are several reasons why LLMs tend to excel at analyzing and synthesizing collocates compared to other types of linguistic analysis:

  • Token-based Training on Large Text Corpora
    LLMs are trained on vast amounts of real-world text, which means they encounter billions of word co-occurrences. Since collocates are essentially just word pairings that frequently occur together, LLMs are passively "learning" collocational patterns as they process text.

  • Statistical Association Captured by Embeddings
    Modern LLMs use word embeddings, where words with similar contextual usage cluster together in high-dimensional space. Since collocates frequently appear together, their vector representations naturally reflect these relationships, making it easy for LLMs to identify and synthesize them.

  • Contextual Attention Mechanisms
    Transformer-based models (like GPT-4) use attention mechanisms that allow them to weigh relevant words in context. This means that when predicting or analyzing a word, the model gives higher weight to nearby words that often co-occur with it—effectively mimicking collocational analysis.

  • Latent Semantic Understanding
    LLMs do not just memorize co-occurrences; they also learn semantic relationships. This allows them to generalize patterns of collocation, making inferences about which words are likely to collocate even if they have not explicitly seen them together in training.

  • Ability to Compare and Contrast
    Because LLMs represent words as vectors, they can easily compare the collocational profiles of different words. If two words have distinct collocates, the LLM can infer meaningful contrasts between them—much like a human linguist would when analyzing corpus data.

Traditional "collocates", as used in corpus linguistics, are simply a "first order level of statistical similarity". There are more advanced and powerful orders of statistical similarity, such as "similar context", "functional role", and so on. The incredibly powerful neural network architecture underlying LLMs provide them with access to all of these levels of statistical similarity, not just the simplistic "surface level" association measures used in corpus linguistics. So it is perhaps no surprise that the collocates from LLMs intuitively seem so good to native speakers. And add to this this fact that the LLMs provide extremely useful summaries of the collocates and what they tell us about the meaning and usage of a word, which is something that no corpora currently offer.

1.2  Word frequency

The predictions of LLMs in terms of word frequency match the actual frequency data from corpora quite well. For example, if we give an LLM words that are at #2,000, #4,000, #8,000, #16,000, and #32,000 in a large frequency list that is based on corpus data and has been carefully corrected, the LLM is very good at ranking these words by frequency.

Why? Again, we probably don't want to read too much into an LLM's "introspection" about its own abilities, but Gemini says the following (emphasis added. Pay special attention to the third point, and see also the similar answer from GPT):

  • Massive Textual Dataset: I was trained on a colossal dataset of text and code. This dataset includes a massive amount of written material. By analyzing the frequency of words within this dataset, I can develop a sense of which words are more common than others.

  • Statistical Analysis: During training, I learned to identify patterns and statistical relationships within the text. This includes calculating the frequency of individual words and how often they appear together with other words.

  • Internal Representation: I maintain an internal representation of word frequencies based on my training data. This representation allows me to quickly estimate the relative frequency of different words when prompted.

I have consulted with AI experts, and they say that it is unlikely that LLMs actually have anything like a "lookup table" of word frequencies. But clearly there is something in the models that allows them to accurately rank words in ways that align with accurate word frequency data.

1.3  Analyzing vs generating words and phrases

As mentioned in #1, the LLMs are very good at generating collocates for a given word. But aside from collocates, they perform much worse at generating linguistic data than in analyzing data (in terms of how well their predictions match the actual corpus data). For example:

  • They are poor at generating lists of words, such as words with the strings: spri*, *break*, *istic.

  • They are poor at generating lists for lower frequency words, such as in the #30,000-60,000 range of a large word frequency list.

  • They are very poor at generating phrases, such as ADJ industry, dark NOUN, * point *, or he * his *. But crucially, they are not "wrong" in the sense that they don't make sense. They simply don't match the corpus data very well.

  • They are better at analyzing historical and genre-based lexical variation than in generating such lists (for example, what words are more common in one historical period than another, or in one genre more than another)

What is fascinating is that when the LLMs are presented with a mixed list of frequency-derived phrases from corpora, as well as the original phrases that they themselves had suggested (which typically didn't agree well at all with the corpus data) then the LLMs accurately rank the phrases in ways that agree with the corpus data. In other words, the LLM is essentially making a half-hearted attempt at generating good phrases, but it is equally quick to admit that "Well, my list wasn't very good after all; these other phrases are actually better than mine".

Why the difference? Most experts in AI / Machine Learning (ML) know that LLMs are better at categorizing data than generating data, so it probably shouldn't be a surprise that this is the case in our tests as well. In addition, the "insights" from GPT and Gemini may be of interest as well (again, recognizing the limitation of LLM introspection.)

1.4  Genres

Overall, the LLMs do have a fairly good sense of differences between genres, although the data from the LLM matches the corpus data much better for certain genres than for others:

  • They seem to have a very good sense of what "academic English" looks like. They can compare academic well to other genres, and they can compare well between sub-genres of academic (like medicine, law, science, philosophy, etc.)

  • As mentioned above, when generating word lists in different genres, the LLMs can separate academic from fiction very well, academic and newspapers less so, and they struggle with "similar" genres like newspapers and magazines.

  • LLMs have a lot of difficulty separating fiction and genres like spoken and TV and Movies subtitles, since both of these can contain a lot of (informal) dialogue.

Why the difference between genres? To the extent that LLM self-introspection is at all reliable, ChatGPT suggests that this is because "academic texts are structurally consistent and contain domain-specific jargon", whereas this is less salient for other genres.

1.5 Historical and dialectal

My sense is that LLMs are surprisingly good at comparing historical periods and different countries, in terms of lexis (words), syntax, and semantics (meaning, via collocates). Here are just a few examples:

  • Predicting whether words are increasing or decreasing in frequency over time (including in very informal language)

  • Guessing the first decade in which a word increases significantly in frequency

  • Guessing whether a syntactic construction has increased or decreased over time

  • Predicting whether informal words (like in TV and Movies subtitles) are more frequent in one dialect than another (like US vs UK)

  • Choosing in which of fifteen countries a word is most frequent (probably at a level better than most humans)

  • Guessing whether a syntactic construction is more or less frequent in contrasting dialects

In terms of how LLMs account for genre-based, historical, and dialect variation, anyone who has used LLMs knows that they can mimic the style of a particular type of poetry (like haikus or limericks) or an author (like Ernest Hemingway or James Joyce), or make something sound "old fashioned" (like the King James Version of the Bible or Shakespeare), or something very modern and gritty like hip-hop lyrics. (See somewhat humorous paraphrases of the contents of Section 1.1 above, in each of these styles of text: GPT, Gemini) So clearly, LLMs can tap into something that provides information about variation.

But there is a wide range of data from the genre, historical, and dialectal papers that indicates that the LLMs are also paying attention to something other than strictly linguistic clues about genre, historical period, and dialect. For example:

  • They are better at predicting the first decade of significant use for "obvious" words (like battleship, spacesuit, supercomputer selfie) than "less obvious" words (like corny, upped, lifestyle, caregiver). If the LLMs were somehow getting clues about what decade a text was from (e.g. "the words in the text suggest that it is from the 1930s"), then it should do equally as well on both sets of words. But the fact that it does better with "obvious" words suggests that it is paying attention to statements from books from any decade that say things like "battleships began to be used in the 1890s", and that is how it knows that the use of the word battleship became more popular in the 1890s.

  • Many of the words that the LLMs generate for the 1960s-1970s (turntable, trolley, thermos, tracksuit) actually aren't more frequent in the 1960s-1970s than in later decades. (This is at least the case for COHA; others might want to systematically examine these words in Google Books n-grams). My guess is that somehow the LLMs have created a "model" of what the 1960s "look like", and they are using that, rather than actual linguistic data.

  • The LLM lists for phrases (smart NOUN, data NOUN) that would have been more frequent in 2010-2014 than in 2010-2014 bear little resemblance to the actual data from a 20+ billion word corpus for those periods.

  • In terms of dialectal variation, the LLMs are better at "obvious words" (subarctic = CA, inter-caste = IN, apartheid = ZA) than "less obvious" words (shaggy= JM, leaky (homes) = NZ, draggy = SG). If it were paying attention only to strictly linguistic clues ("shaggy is actually more frequent in texts from Jamaica"), that distinction shouldn't exist.

And how can we account for the ability of the LLMs to know so much about variation in syntax -- between genres (e.g. get passive, END up V-ing), over time (e.g. they have no time -> they don't have time), and across dialects (e.g. STOP PRON (from) V-ing)? In many cases, the LLMs map out the variation even better than many linguists could do. And yet there are times when they appear to simply be "parrotting" information from published sources like grammar books, style guides, and textbooks, such as with the dialectal variation of try and VERB (vs try to VERB) or genre-based variation with accustomed to VERB/VERB-ing.

Finally, there is clear evidence that in some cases the LLMs definitely are just "regurgitating" information from other sources, such as web pages that explicitly discuss the use of a word or a syntactic construction. To give just one example, GPT provides links to an online dictionary and (interestingly) a page from English-Corpora.org when discussing the meaning of cupboard in the UK and the US. So in a case like this, did the LLM already "know" the difference between the two dialects, and only then did it go in search of collocates to support that view? And more generally, how can we distinguish between novel linguistic analyses based on actual data from the trillions of words of data that underlies their models, and the simple "parroting" of what others have said about a word or a construction?


2. What is the role for traditional corpora in the age of Large Language Models?

As we have seen, for certain tasks LLMs provide quality linguistic data that may surprise some researchers, teachers, and learners. For example, the collocates generated by LLMs may be more insightful and helpful than actual corpus-based collocates. In addition, they provide explanations of what these collocates tell us about the meaning and usage of a word, and these summaries of collocates and explanations are even more useful when comparing words.

But I would argue that there are still (as of early 2025) important advantages with corpora.

2.1 Actual frequency data and ability to verify the data

The LLMs predict that one word or phrase might be more frequent than another, or that a certain syntactic construction may have a certain distribution across genres, time periods, or dialects. We essentially "have to take the LLM's word for it". It rarely provides proof for these assertions. But with corpora, we can see the actual number of tokens, and we can see the actual KWIC (Keyword in Context) entries and their source, and verify that the data is correct. And with the corpora from English-Corpora.org, you can actually provide a link to your searches (for use in a research paper or for student use in the classroom), so that others see exactly what you saw when you did the search (more: PDF, video). And in addition to use the corpus interface via English-Corpora.org, you can also download the actual corpus data, as well as derived word frequency, collocates, and ngrams data. There should never be any question about the reliability and verifiability (and potential falsifiability) of the corpus data.

2.2 Consistent data

The LLMs may provide wildly different "data" for the same prompt, on different occasions. For example, compare the predictions from the LLMs on two different occasions GPT (#1, #2) and Gemini (#1, #2), when they were asked to generate the 20 most frequent words for each of the three strings spri*, *break*, *istic. The identical prompts yielded quite different results on the two separate days, which were just a couple of weeks apart. For example, on the first day GPT suggested the following spri* words: spring, springs, sprint, sprinting, sprinter, springtime, sprints, sprightly, springing, sprig, sprigs, springy, sprinters, sprinted, spritz, spritzed, spritzes, spritzing, spritely, sprinter's. The second day it suggested: spring, sprinkle, sprint, springtime, springing, sprinkled, sprinter, sprinkling, springboard, springer, sprightly, sprite, sprigs, springy, sprig, sprinkler, sprinting, sprinted, spright, spritz. Only 12 of the 20 words are the same between the two days, and I got similar results for the other two searches in GPT and all three searches in Gemini (see data). With a corpus (assuming that the corpus is not changing from day to day), you will get the same results for the same search.

2.3 No worries about "hallucinating"

As we have seen, the LLMs are generally accurate when it comes to linguistic variation. But we have also seen that it sometimes just "makes up" data that neither seems reasonable nor can it be verified with corpus data. The corpus data can always be verified, especially when (as with English-Corpora.org), you can create links to your searches, so that others can see and examine exactly what you did when you did the search.

2.4 No worries about "regurgitating" (incorrect) information from other sources

There are clear examples of where the LLM is simply repeating something that it saw in a dictionary, a grammar book, or a web page, but where that information is wrong (such as in the last paragraph of the previous section). Sometimes these other sources are right and sometimes they are wrong, but the LLM has trouble knowing which is which. With actual corpus data, we can verify and potentially disprove the claims that are made by others.

2.5 Ability to carry out more narrow / advanced searches

Most of the syntactic searches that I carried out in the genres, historical, and dialectal pages were fairly broad -- such as the get passive, the be passive, the like construction, and so on. The one time I did get really "narrow and deep" (to VERB vs VERB-ing with start and also accustomed), I started getting simplistic "grammar book" answers that didn't agree well with the corpus data. And sometimes we do want to get more narrow than the overall construction. For example, in the case of negative placement with have (they have no time -> they don't have time), we might want to also look at genre, to see if the pre-verbal placement (they don't have time) spread from informal genres to more formal ones. Or with the STOP PRON (from) VERB-ing construction (they stopped him (from) leaving), maybe the variation depends on how long the intervening noun phrase is -- stopped { him / his friend / all of the other people in the group } (from) leaving. Can we trust the LLMs to provide us with quality, testable data on something that narrow?

2.6 Generating word and phrase data

We saw in Section 1.3 above that the LLMs have real difficulty in generating word and phrase data that agrees with the corpus data (although they are much better at categorizing and analyzing such data). With a corpus, we can easily run a query to see word frequency (e.g. spri* or *istic or un*able) or phrase frequency (e.g. ADJ industry, dark NOUN, he * his *, to * the *). And then we can immediately check the data -- for example whether it is limited to just a few texts or whether it is spread evenly across the corpus. We just can't trust or check the LLM data in that way.

2.7 KWIC lines

One of the "bread and butter" tools of corpora is the ability to see Keyword in Context (KWIC) lines for any word or phrase, and then sort the lines to find interesting patterns in which the word or phrase occurs. Similar searches in GPT and Gemini are much more "clunky" than with the corpora, and of course the KWIC lines are just made up by the LLM. They don't represent actual phrases from the language, as they do with corpora.

2.8 Integrated / linked data and searches

Via the LLM interface, you enter a question at the prompt and then you see the answer, with little or no links to other related information. But one of the main strengths of a well-planned and user-oriented corpus interface is the ability of a user to move from one word or one display to another, and thus follow a "chain" of related words and phrases. To give just a few (from among many possible) examples from English-Corpora.org, users can:

  • See detailed information on each of the top 60,000 words (lemmas) in the corpora, including 1) detailed frequency information, including frequency by genre 2) definitions, including links to external definitions and etymology 3) links to images 4) links to pronunciation and external videos 5) links to translations from several external websites 6) synonyms 7) more specific and more general words from WordNet 8) other words where the word is used as part of the definition for those words 9) morphologically-related words 10) collocates 11) related “topics”, co-occurring words found anywhere in the text and not just within 3-4 words 12) the most common 2, 3, and 4 word clusters 13) texts or websites where the word is very frequent, to create “Virtual Corpora” for the word, and 14) re-sortable concordance (KWIC) lines (more: PDF, video). Many of those links (e.g. topics, collocates, clusters, or KWIC entries) lead to extremely detailed pages that will also link to other words and phrases.

  • Do a search for words, phrases, or collocates, and then find detailed information (see above) on any word in the results

  • See a Keyword in Context (KWIC) for a word or phrases or construction, and then get detailed information on any of the words (see above) in the KWIC display (more: PDF, video)

  • Browse through the top 60,000 words in the corpus (by word form, frequency, definition, and even pronunciation), and then see detailed information (see above) on any word in the list (more: PDF, video)

  • Enter an entire text (for example the contents of an online newspaper or magazine article), and then see related phrases for any phrase in the text, or detailed information (see above) on any word in the text (more: PDF, video)

  • And all of this information includes links to other words and phrases. So you could move from one word to another to another in the collocates lists, and thereby explore a "forest" of related words


Notes for AI/ML researchers

With such powerful LLMs at their disposal, AI and ML (machine learning) researchers might feel that corpora and corpus linguistics has little if anything to offer, and that corpora are a "quaint throwback" to the 1990s or before. However, robust, well-developed corpora may still have a role to play in terms of LLM development, especially in terms of the issues raised in 2.1-2.7 above. Perhaps most importantly, corpus data is actually verifiable (see 2.1 above), and you can you download the actual corpus data, as well as derived word frequency, collocates, and ngrams data. The corpus data may also be helpful for tool use and Retrieval Augmented Generation (RAG). For a short (and hopefully not too simplistic overview), you might take a look at these quick summaries from GPT and Gemini about how corpora can inform LLMs. Because I am not an AI/ML researcher, I would be interested in other areas (fine tuning?, distillation?) where you think that actual corpus data might be helpful.

Notes re corpus linguistics

Corpus linguistics is a field where tradition is important. Many researchers are still using the same corpora that they used 20-30 years ago. In addition, it is almost "taboo" to use corpora and datasets that are not carefully, lovingly crafted according to exacting standards, and even in the 2020s, great care is taken to create new corpora that look just like corpora from the early 1990s *. It is therefore unlikely that many corpus linguists will seriously consider how they can incorporate insights from LLMs into their research; it's just too far outside of their comfort zone. But I argue that this may may lead others to view their field and many of their corpora as increasingly irrelevant.