Mark Davies / March 2025

This page is a summary of seven detailed studies that provide extensive data on how well the
predictions of two LLMs -- GPT-4o (from OpenAI) and Gemini 1.5 Pro (from
Google) -- match the actual data from large, well-known,
publicly-accessible corpora (like those from English-Corpora.org).
You can also see a summary of these comments in a
video on YouTube.
The seven detailed studies (with a total of 89
pages of discussion and examples) are found here:
word frequency,
phrase frequency,
collocates,
comparing words (via collocates),
genre-based variation,
historical variation, and dialectal
variation. As far as I am aware, no other studies have compared LLMs and
corpora at this level of detail.
There are two main questions that I
will attempt to answer in this summary:
-
How well do the predictions of the
LLMs match actual corpus data, and what might account for the accuracy
(or inaccuracy) with the specific tasks?
-
What is the role for
traditional corpora in the age of LLMs?
In addition, I will end with a
couple of notes and suggestions for AI/ML researchers, as well
as for corpus linguists.
1. How well do the predictions of
the LLMs match actual corpus data?
My original expectations of LLMs were
quite low. Although I have been "retired" since 2020, I was a
professor of corpus linguistics (the use of large collections
of structured language data, for use in linguistic analysis) for more than 35
years. I have spent much of that time creating what have become the
most widely used corpora
of English throughout the world (as well as for
Spanish /
Portuguese), and I
have used this data myself for
many studies
on variation in language (genre-based, historical, and dialectal
variation).
After all of that time and effort, I was expecting (and
almost secretly hoping) that the LLMs -- which had never been
explicitly trained to generate this type of linguistic data -- would do very
poorly. But I was very surprised at how well they in fact did. The "linguistic
understanding" of LLMs varies widely according to the specific task.
But for some tasks, they were exceptionally good -- arguably equal
to or better than the intuitions of even trained linguists. And for certain tasks,
they generated (in my opinion) much better data than even the most
advanced corpora.
As far as "explaining" why LLMs do
better on specific tasks (meaning that their predictions match
corpus data well), I should note that because of their very nature,
neural networks (which underlie LLMs) are inherently "opaque", and
it is often difficult or impossible to know exactly why a model
reached a particular conclusion (see
Lipton
2018, Bau
2024, Davies
and Khakzar 2024). Even experts in AI, who work with LLMs,
may disagree on the "mechanics" of what is going on within a model
for any particular output. We also need to be very careful with the
"introspections" of LLMs themselves, if we ask them how and why they
generated a particular output. They are often no better at analyzing
their own decisions and processes than humans, who are notoriously
bad at "knowing" precisely why they made a particular decision. So
we should keep that in mind in the following sections, as we discuss
some factors that may have been involved.
1.1 Collocates (word meaning and
usage) and comparing words
Let's start with what the LLMs did
exceptionally well -- generating collocates. Collocates are
"nearby words", which provide insight into the meaning and usage of
a word. For example (nouns) bodice,
sprawl,
telescope,
asthma,
kombucha,
enzyme (verbs)
evoke,
excavate,
condense,
decode (adjectives)
infrared,
rusty,
tactile,
recyclable (adverbs)
naturally,
consciously.
As the collocates and "comparing words" (with collocates) pages show,
the collocates generated by GPT and Gemini were in most cases much
better than the collocates from corpora like Sketch Engine (which is
known for its detailed collocates-based "word sketches") and
English-Corpora.org. And even in terms of variation across
genres,
historical periods, and
dialects, the data from the LLMs is very accurate and
insightful.
How is it that the collocates generated by the
LLMs encapsulate so well the meaning
and usage of a word? I asked GPT (conversation)
and Gemini (conversation)
about this, and I found their answers extremely insightful (again, realizing the
limitations of their own introspection). ChatGPT suggested the following (and
pay particular attention to the highlighted section).
There are several reasons why
LLMs tend to excel at analyzing and synthesizing collocates
compared to other types of linguistic analysis:
-
Token-based Training
on Large Text Corpora
LLMs are trained on vast amounts of real-world text, which
means they encounter billions of word co-occurrences. Since
collocates are essentially just word pairings that
frequently occur together, LLMs are passively "learning"
collocational patterns as they process text.
-
Statistical
Association Captured by Embeddings
Modern LLMs use word embeddings, where words with similar
contextual usage cluster together in high-dimensional space.
Since collocates frequently appear together, their vector
representations naturally reflect these relationships,
making it easy for LLMs to identify and synthesize them.
-
Contextual Attention
Mechanisms
Transformer-based models (like GPT-4) use attention
mechanisms that allow them to weigh relevant words in
context. This means that when predicting or analyzing a
word, the model gives higher weight to nearby words that
often co-occur with it—effectively mimicking collocational
analysis.
-
Latent Semantic
Understanding
LLMs do not just memorize co-occurrences; they also learn
semantic relationships. This allows them to generalize
patterns of collocation, making inferences about which words
are likely to collocate even if they have not explicitly
seen them together in training.
-
Ability to Compare
and Contrast
Because LLMs represent words as vectors, they can easily
compare the collocational profiles of different words. If
two words have distinct collocates, the LLM can infer
meaningful contrasts between them—much like a human linguist
would when analyzing corpus data.
|
Traditional "collocates", as used in corpus linguistics, are simply
a "first order level of statistical similarity". There are more
advanced and powerful orders of statistical similarity, such as
"similar context", "functional role", and so on. The incredibly
powerful neural network architecture underlying LLMs provide them
with access to all of these levels of statistical similarity, not
just the simplistic "surface level"
association measures used in corpus linguistics. So it is
perhaps no surprise that the collocates from LLMs intuitively seem
so good to native speakers. And add to this this fact that the LLMs
provide extremely useful summaries of the collocates and what they
tell us about the meaning and usage of a word, which is something
that no corpora currently offer.
1.2 Word frequency
The predictions of LLMs in terms of
word frequency match the actual frequency data from corpora quite
well. For example, if we give an LLM words that are at #2,000, #4,000, #8,000,
#16,000, and #32,000 in a
large frequency list that is based on corpus data and has been
carefully corrected, the LLM is very good at ranking these words
by frequency.
Why? Again, we probably don't want to read too
much into an LLM's "introspection" about its own abilities, but
Gemini says the
following (emphasis added. Pay special attention to the third point,
and see also the
similar answer from GPT):
-
Massive Textual
Dataset: I was trained on a colossal dataset of
text and code. This dataset includes a massive amount of
written material. By analyzing the frequency of words within
this dataset, I can develop a sense of which words are more
common than others.
-
Statistical Analysis:
During training, I learned to identify patterns and
statistical relationships within the text. This includes calculating the frequency of individual words and how
often they appear together with other words.
-
Internal
Representation: I maintain an internal
representation of word frequencies based on my training
data. This representation allows me to quickly estimate the
relative frequency of different words when prompted.
|
I have consulted with AI experts,
and they say that it is unlikely that LLMs actually have anything
like a "lookup table" of word frequencies. But clearly there is
something in the models that allows them to
accurately rank words in ways that
align with accurate
word frequency data.
1.3 Analyzing vs generating
words and phrases
As mentioned in #1, the LLMs are very
good at generating collocates for a given word. But aside from
collocates, they perform much worse at generating linguistic
data than in analyzing data (in terms of how well their
predictions match the actual corpus data). For example:
-
They are poor at
generating lists
of words, such as words with the strings: spri*, *break*, *istic.
-
They are poor at
generating lists
for lower frequency words, such as in the #30,000-60,000
range of a large word frequency list.
-
They are very poor at
generating
phrases, such as ADJ industry, dark NOUN, * point *, or
he * his *. But crucially, they are
not "wrong" in the sense
that they don't make sense. They simply don't match the corpus
data very well.
-
They are better at analyzing
historical and genre-based lexical variation than in generating
such lists (for example, what words are more common in one
historical period than another, or in one
genre more than another)
What is fascinating is that when the
LLMs are presented with a mixed list of frequency-derived phrases
from corpora, as well as the original phrases that they themselves
had suggested (which typically didn't agree well at all with the
corpus data) then
the LLMs accurately rank the phrases
in ways that agree with the corpus data. In other words, the LLM is
essentially making a half-hearted attempt at generating good
phrases, but it is equally quick to admit that "Well, my list wasn't
very good after all; these other phrases are actually better than
mine".
Why the difference? Most experts in AI / Machine
Learning (ML) know that LLMs
are better at categorizing data than generating data, so it probably
shouldn't be a surprise that this is the case in our tests as well. In addition,
the "insights" from
GPT and Gemini
may be of interest as well (again, recognizing the limitation of LLM
introspection.)
1.4 Genres
Overall, the LLMs do have a fairly good
sense of differences between genres, although the data from the LLM
matches the corpus data much better for certain genres than for
others:
-
They seem to have a very good sense
of what "academic English" looks like. They can
compare academic well to other genres, and they can compare
well between
sub-genres of academic (like medicine, law, science,
philosophy, etc.)
-
As mentioned above, when generating word lists in
different genres, the LLMs can separate
academic from fiction very well,
academic and newspapers less so, and they struggle with
"similar" genres like
newspapers and magazines.
-
LLMs have a lot of
difficulty separating fiction and genres like spoken and TV
and Movies subtitles, since both of these can contain a lot of
(informal) dialogue.
Why the difference between genres?
To the extent that LLM self-introspection is at all reliable,
ChatGPT suggests that this is because "academic texts are
structurally consistent and contain
domain-specific jargon", whereas this is less salient
for other genres.
1.5 Historical and
dialectal
My sense is that LLMs are surprisingly good at comparing
historical periods and
different countries, in terms of lexis (words), syntax, and
semantics (meaning, via collocates). Here are just a few examples:
-
Predicting
whether words are increasing or decreasing in frequency over
time (including
in very
informal language)
-
Guessing the first decade in which a word increases
significantly in frequency
-
Guessing whether a syntactic construction has increased or
decreased over time
-
Predicting
whether informal words (like in TV and Movies subtitles) are
more frequent in one dialect than another (like US vs UK)
-
Choosing in which of fifteen countries a word is most frequent
(probably at a level better than most humans)
-
Guessing whether a syntactic construction is more or less
frequent in contrasting dialects
In terms of how LLMs account for
genre-based, historical, and dialect variation, anyone who has used
LLMs knows that they can mimic the style of a particular type of
poetry (like haikus or limericks) or an author (like Ernest
Hemingway or James Joyce), or make something sound "old fashioned"
(like the King James Version of the Bible or Shakespeare), or
something very modern and gritty like hip-hop lyrics. (See somewhat
humorous paraphrases of the contents of Section 1.1 above, in each
of these styles of text:
GPT,
Gemini) So
clearly, LLMs can tap into something that provides information about
variation.
But there is a wide range of data from the
genre, historical, and dialectal papers that indicates that the LLMs are also
paying attention to something other than strictly linguistic clues about
genre, historical period, and dialect. For example:
-
They are better at predicting the
first decade of significant use for "obvious" words (like
battleship, spacesuit, supercomputer selfie) than "less obvious" words
(like corny, upped, lifestyle, caregiver). If the LLMs were somehow
getting clues about what decade a text was from (e.g. "the words in the text
suggest that it is from the 1930s"), then it should do equally as well on
both sets of words. But the fact that it does better with "obvious" words
suggests that it is paying attention to statements from books from any
decade that say things like "battleships began to be used in the 1890s",
and that is how it knows that the use of the word battleship
became more popular in the 1890s.
-
Many of the words that the LLMs generate for
the 1960s-1970s (turntable, trolley, thermos, tracksuit)
actually aren't more frequent in the 1960s-1970s than in later decades.
(This is at least the case for COHA; others might want to systematically
examine these words in Google
Books n-grams). My guess is that somehow the LLMs have created a
"model" of what the 1960s "look like", and they are using that, rather than
actual linguistic data.
-
The LLM lists for phrases (smart NOUN,
data NOUN) that would have been
more frequent in 2010-2014 than in 2010-2014 bear little resemblance
to the actual data from a
20+ billion word corpus
for those periods.
-
In terms of
dialectal variation, the LLMs are
better at "obvious words" (subarctic = CA, inter-caste = IN,
apartheid = ZA) than "less obvious" words (shaggy= JM, leaky
(homes) = NZ, draggy = SG). If it were paying attention
only to strictly linguistic clues ("shaggy is actually more frequent
in texts from Jamaica"), that distinction shouldn't exist.
And how can we account for the ability of the
LLMs to know so much about variation in syntax -- between
genres (e.g. get passive, END up V-ing),
over time (e.g. they have no
time -> they don't have time), and across
dialects (e.g. STOP PRON (from) V-ing)?
In many cases, the LLMs map out the variation even better than many linguists
could do. And yet there are times when they appear to simply be "parrotting"
information from published sources like grammar books, style guides, and
textbooks, such as with the dialectal variation of
try and VERB (vs try to VERB) or
genre-based variation with accustomed to
VERB/VERB-ing.
Finally, there is clear evidence that in some
cases the LLMs definitely are just "regurgitating" information from other
sources, such as web pages that explicitly discuss the use of a word or a
syntactic construction. To give just one example, GPT provides links to an
online dictionary and (interestingly)
a page from
English-Corpora.org when discussing the
meaning of cupboard in the UK and the US. So in a case like this,
did the LLM already "know" the difference between the two dialects, and
only
then did it go in search of collocates to support that view? And more generally,
how can we distinguish between novel linguistic analyses based on actual
data from the trillions of words of data that underlies their models, and the
simple "parroting"
of what others have said about a word or a construction?
2. What is the role for
traditional corpora in the age of Large Language Models?
As we have seen, for certain tasks LLMs provide
quality linguistic data that may surprise some researchers, teachers, and
learners. For example, the collocates
generated by LLMs may be more insightful and helpful than actual corpus-based
collocates. In addition, they provide explanations of what these collocates tell
us about the meaning and usage of a word, and these summaries of collocates and
explanations are even more useful when comparing
words.
But I would argue that there are still (as of
early 2025) important advantages with corpora.
2.1 Actual frequency data and ability to
verify the data
The LLMs predict that one word or phrase might
be more frequent than another, or that a certain syntactic construction may have
a certain distribution across genres, time periods, or dialects. We essentially
"have to take the LLM's word for it". It rarely provides proof for these
assertions. But with corpora, we can see the actual number of tokens, and we can
see the actual KWIC (Keyword in Context) entries and their source, and verify
that the data is correct. And with the corpora from English-Corpora.org, you can
actually provide a link to your searches (for use in a research paper or for
student use in the classroom), so that others see exactly what you saw when you
did the search (more:
PDF,
video). And in addition to use the corpus interface via English-Corpora.org,
you can also download the actual corpus
data, as well as derived word
frequency, collocates, and
ngrams data. There should never be any question about the reliability and
verifiability (and potential falsifiability) of the corpus data.
2.2 Consistent data
The LLMs may provide wildly different "data" for
the same prompt, on different occasions. For example, compare the predictions
from the LLMs on two different occasions GPT (#1,
#2)
and Gemini (#1,
#2), when they were
asked to generate the 20 most frequent words for each of the three strings
spri*, *break*, *istic. The identical prompts yielded quite different
results on the two separate days, which were just a couple of weeks apart. For example, on the first day GPT suggested the
following spri* words: spring, springs, sprint, sprinting, sprinter,
springtime, sprints, sprightly, springing, sprig, sprigs, springy, sprinters,
sprinted, spritz, spritzed, spritzes, spritzing, spritely, sprinter's. The
second day it suggested: spring, sprinkle, sprint, springtime, springing,
sprinkled, sprinter, sprinkling, springboard, springer, sprightly, sprite,
sprigs, springy, sprig, sprinkler, sprinting, sprinted, spright, spritz.
Only 12 of the 20 words are the same between the two days, and I got similar
results for the other two searches in GPT and all three searches in Gemini (see
data). With a corpus (assuming that the corpus is not changing from day
to day), you will get the same results for the same search.
2.3 No worries about "hallucinating"
As we have seen, the LLMs are generally accurate
when it comes to linguistic variation. But we have also seen that it sometimes
just "makes up" data that neither seems reasonable nor can it be verified with
corpus data. The corpus data can always be verified, especially when (as with
English-Corpora.org), you can create links to your searches, so that others can
see and examine exactly what you did when you did the search.
2.4 No worries about "regurgitating"
(incorrect) information from other sources
There are clear examples of where the LLM is
simply repeating something that it saw in a dictionary, a grammar book, or a web
page, but where that information is wrong (such as in the last paragraph of the
previous section). Sometimes these other sources are right and sometimes they
are wrong, but the LLM has trouble knowing which is which. With actual corpus
data, we can verify and potentially disprove the claims that are made by others.
2.5 Ability to carry out more narrow /
advanced searches
Most of the syntactic searches that I carried
out in the genres,
historical, and dialectal pages were
fairly broad -- such as the get passive, the be passive, the
like construction, and so on. The one time I did get really "narrow and
deep" (to VERB vs VERB-ing with start
and also accustomed), I started getting simplistic "grammar book"
answers that didn't agree well with the corpus data. And sometimes we do want to
get more narrow than the overall construction. For example, in the case of
negative placement with have
(they have no time -> they don't
have time), we might want to also look at genre, to see if the
pre-verbal placement (they don't have time) spread
from informal genres to more formal ones. Or with the
STOP PRON (from) VERB-ing construction
(they stopped him (from) leaving), maybe the
variation depends on how long the intervening noun phrase is -- stopped {
him / his friend / all of the other people in the group } (from)
leaving. Can we trust the LLMs to provide us with quality, testable data on
something that narrow?
2.6 Generating word and phrase data
We saw in Section 1.3 above that the LLMs have
real difficulty in generating word and phrase data that agrees with the
corpus data (although they are much better at categorizing and analyzing such
data). With a corpus, we can easily run a query to see word frequency (e.g.
spri*
or *istic
or
un*able)
or phrase frequency (e.g.
ADJ industry,
dark NOUN,
he * his
*,
to * the *).
And then we can immediately check the data -- for example whether it is limited
to just a few texts or whether it is spread evenly across the corpus. We just
can't trust or check the LLM data in that way.
2.7 KWIC lines
One of the "bread and butter" tools of corpora
is the ability to see
Keyword in Context
(KWIC) lines for any word or phrase, and then sort the lines to find
interesting patterns in which the word or phrase occurs. Similar searches in
GPT
and Gemini are much
more "clunky" than with the corpora, and of course the KWIC lines are just made
up by the LLM. They don't represent actual phrases from the language, as they do
with corpora.
2.8 Integrated / linked data and searches
Via the LLM interface, you enter a question at
the prompt and then you see the answer, with little or no links to other related
information. But one of the main strengths of a well-planned and user-oriented
corpus interface is the ability of a user to move from one word or one display to another,
and thus follow a "chain" of related words and phrases. To give just a few (from
among many possible) examples from English-Corpora.org, users can:
-
See detailed information on each of the top
60,000 words (lemmas) in the corpora, including 1) detailed frequency
information, including frequency by genre 2) definitions, including links to
external definitions and etymology 3) links to images 4) links to
pronunciation and external videos 5) links to translations from several
external websites 6) synonyms 7) more specific and more general words from
WordNet 8) other words where the word is used as part of the definition for
those words 9) morphologically-related words 10) collocates 11) related
“topics”, co-occurring words found anywhere in the text and not just within
3-4 words 12) the most common 2, 3, and 4 word clusters 13) texts or
websites where the word is very frequent, to create “Virtual Corpora” for
the word, and 14) re-sortable concordance (KWIC) lines (more:
PDF,
video). Many of those links (e.g. topics, collocates, clusters, or KWIC
entries) lead to extremely detailed pages that will also link to other words
and phrases.
-
Do a search for words, phrases, or
collocates, and then find detailed information (see above) on any word in
the results
-
See a Keyword in Context (KWIC) for a word
or phrases or construction, and then get detailed information on any of the
words (see above) in the KWIC display (more:
PDF,
video)
-
Browse through the top 60,000 words in the
corpus (by word form, frequency, definition, and even pronunciation), and
then see detailed information (see above) on any word in the list (more:
PDF,
video)
-
Enter an entire text (for example the
contents of an online newspaper or magazine article), and then see related
phrases for any phrase in the text, or detailed information (see above) on
any word in the text (more:
PDF,
video)
-
And all of this information includes links
to other words and phrases. So you could move from one word to another to
another in the collocates lists, and thereby explore a "forest" of related
words
Notes for AI/ML researchers
With such powerful LLMs at their disposal, AI
and ML (machine learning) researchers might feel that corpora and corpus
linguistics has little if anything to offer, and that corpora are a "quaint
throwback" to the 1990s or before. However, robust, well-developed corpora may still have a role to
play in terms of LLM development, especially in terms of the issues raised in
2.1-2.7 above. Perhaps most importantly, corpus data is actually verifiable (see
2.1 above), and you can you download the actual
corpus data, as well as derived
word frequency,
collocates, and
ngrams data. The corpus data may also be
helpful for tool use and Retrieval Augmented Generation (RAG). For a short (and hopefully not too
simplistic overview), you might take a look at these quick summaries from
GPT
and Gemini about how
corpora can inform LLMs. Because I am not an AI/ML researcher, I would be
interested in other areas (fine tuning?, distillation?) where you think that
actual corpus data might be helpful.
Notes re corpus linguistics
Corpus linguistics is a field where tradition is
important. Many researchers are still using the same corpora that they used
20-30 years ago. In addition, it is almost "taboo" to use corpora and datasets
that are not carefully, lovingly crafted according to
exacting standards, and even in the 2020s, great care is taken to create new
corpora that
look
just like corpora from the early 1990s
*. It is therefore unlikely that many
corpus linguists will seriously consider how they can incorporate insights from
LLMs into their research; it's just too far outside of their comfort zone. But I
argue that this may may lead others to view their field and many of their
corpora as increasingly irrelevant.
|