CORPORA AND AI / LLMs: Some notes on corpus linguistics |
|
Mark Davies / September 2025
Although I am
now "retired",
I was a corpus linguist for more than 35 years, and I
published widely and gave
keynote talks at many of the top conferences in the field. I have
also created (pretty much by myself) what are the most widely used
online corpora, which
are used by thousands of
researchers every year,
and which results in more
than 1,000
academic publications each year. In addition, these
corpora (especially their full-text and
word frequency data) have been
used by many large tech companies in the development of many products
-- some of which you probably use each day.
But I have always
been concerned about how "insular" and "provincial" corpus linguists can be, in
terms of not paying attention to developments and progress outside of this very
narrow field of research. I sometimes worry that because of this attitude, corpus
linguistics risks becoming somewhat irrelevant, and that it will in the future be viewed as a
"quaint, obsolete, inward-looking" field of research, by researchers in other fields that have
taken its place.
As in most academic fields, in corpus linguistics there is an influential "old guard"
-- some of them in prominent positions -- who are
quite opposed to the use of new resources and approaches. Sometimes they feel that these
resources challenge the corpora that they
carefully and lovingly created 10, 20, or even 30 years ago. Sometimes they are
just "traditionalists" -- people who think that others should be doing things
the same way that they did them back when they were graduate students in the
1980s or 1990s. (And it's not always just the older researchers -- some younger
researchers have learned that they can sycophantically advance their career by playing
to the
nostalgia of the older, more influential researchers.)
The following are some anecdotal evidences
for this concern about the state of the field, especially in terms of its
opposition to new resources like AI / LLMs. (I'm going to try to exercise restraint by
not including links that would identify specific people and venues. But "if you
know, you know" :-)
1. In Summer 2025, a well-known corpus linguistics program put on a series of webinars
related to "corpora and AI". They
mainly focused on old tropes about the limitations of AI ("don't even think about using
AI / LLMs until they've completely solved the following eight problems..."), and
they trotted out the same non-AI-enabled tools that they've been pushing for several
years now, as "alternatives to (untrustworthy) AI". These supposedly
cutting-edge webinars on "corpora and AI" proposed nothing new or helpful, in
terms of AI-powered tools that could actually be used to analyze corpora.
2. At a top
Corpus Linguistics conference in in Summer 2025, the well-known
creator of a
widely-used tool for corpus retrieval (which of course has no AI capabilities,
and probably never will)
wandered around from session to session, arguing incessantly with anyone and
everyone who claimed that insights from AI could be integrated with corpus data.
It's hard for me to imagine this happening in 2025 at a conference for molecular biology
or astrophysics or neuroscience or climate science, or any number of other
"data-oriented" fields.
3. At another corpus linguistics conference in
Summer 2025, a keynote speaker said that it would be a serious mistake to put corpora with AI /
LLMs into the hands of students, because it would stunt the development of their "analytic capabilities". The only way for them to really learn to
analyze corpus data was the same way that it was done back in 2015 and 2005 and 1995 and
1985 -- by
painstaking, manual analysis of hundreds of lines of KWIC / concordance lines.
(But of course many language learners don't really want to be "junior corpus
linguists" anyway -- they just want tools to help them learn another language.)
4. A recent book on designing and evaluating
language corpora (by three top researchers in the field, and from a top academic
publisher) argued throughout the book that 1) we really need to keep creating
corpora the same way that we did in the early 1990s, and 2) all that really matters in
a corpus are the texts in the corpus. Anything that makes the corpus more
useful for the end user -- architecture, or interface (and of course anything
like AI) -- isn't really worth discussing (at least not in their book).
5. I have been arguing for 10-15 years now (see
Davies 2014, 2015, 2016 below) that very large text archives (like Google
Books, or very large text archives) can yield very insightful data, even if those text
collections are not a perfectly created corpus. But if you look at the last 4-5
years of the top journals in corpus linguistics, you will see how few (if any)
of the articles deal with large text archives like this, or even very large
corpora like those from Sketch Engine.
6. To give a concrete example of #5, one of my sons
is finishing up a PhD in Natural Language Processing / Machine Learning/ AI. He was a
researcher on a project
that used a 100 billion
word historical corpus of American newspapers and magazines to look at
discussions of social issues over the last
150-200 years. As I compared their data to data from the 475
million word
COHA corpus (lovingly and
painstakingly created by me in 2010, but just 1/200th the size of their
"corpus") -- I was absolutely blown away by how rich and insightful their data was,
and how effortlessly they extracted the data using advanced NLP / ML techniques
(such as neural networks). But crucially, their data came from collections of
texts that violate many of the
sacrosanct guidelines that are emphasized so much in corpus linguistics. Is
anyone in the narrow field of corpus linguistics paying attention to articles
like this?
7. On the other hand, if you look at recent articles in the top journals
in corpus linguistics, and you'll see how many articles there deal with
"field-internal" debates (i.e. "navel gazing"), such as exactly what association measure we should use
to find collocates, or how a different dispersion measure can produce
slightly better frequency lists. But many of these researchers are just "big
fishes in a small pond". They really don't understand that the ground has
shifted under their feet, and that researchers in Natural Language Processing /
Machine Learning/ AI are using much more sophisticated methods (such as advanced
neural networks ) to extract data from trillions of words of data.
And these are just the anecdotes that I am
personally familiar with. Now that I am (technically) "retired", I only rarely
attend conferences or participate in reviews of conference submissions and journal
articles. As a result, I'm sure that I'm missing out on many, many other such anecdotes.
My overall concern is that the entire field of
corpus linguistics may end up becoming
increasingly irrelevant,
and being ignored by researchers in what could and should be allied fields. For
example, the
CORPORA listserv uses the name "corpora", but the vast majority of the
posts there are now from the fields of Natural Language Processing and Computational
Linguistics, and the vast majority of their conference postings are related to
AI. But NLP researchers at CORPORA rarely mention any work from corpus
linguistics, they seem to be blithely unaware of what corpus linguists are doing, and
there is very little interaction between the two fields. (I've been following
the listserv for more than 20 years, and it didn't used to be this way. Once
upon a time, corpus linguists were actually the majority there, and there were actual conversations between the two fields.)
So where to go from here? It seems like we are at
the proverbial "fork in the road" with AI and corpora right now. To
make the issue a bit more interesting and relatable, I had
ChatGPT,
Gemini, and
Claude create stories (which they all did in just 1-2 seconds!) about a
carriage
maker in the early 1900s who refuses to adapt in the face of the
"newfangled automobiles", and who slips into irrelevance as a result
of that. For those of you in
the field of corpus linguistics, how many similarities do you see between those
stories and what you see happening in the field right now?
To elaborate a bit on that analogy, of course there were problems with automobiles
(noise, pollution, safety, use of fossil fuels, etc). But wandering around town,
mumbling about "how to
make a better horse-drawn carriage" in the early 1900s was undoubtedly the wrong approach.
In retrospect, the best course of action would have been to take the knowledge
and skills of a carriage maker, and to find a way to make those skills
applicable in the new world of "horseless
carriages".
My hope is that corpus linguists will choose to stay
relevant in this "brave new world" of LLMs, and that we can more fully
embrace the use of AI/ LLMs, along with actual corpus data -- in our
teaching and research. The future of our field depends on this.
References
-
Davies, Mark (2014) "Making Google Books
n-grams useful for a wide range of research on language change".
International Journal of Corpus Linguistics 19 (3): 401-16.
-
Davies, Mark (2015) "Corpora: An
Introduction". In Cambridge Handbook of English Corpus Linguistics,
eds. Douglas Biber and Randi Reppen. Cambridge: Cambridge University Press.
11-31.
-
Davies, Mark and Don Chapman. (2016) "The
Effect of Representativeness and Size in Historical Corpora: An Empirical
Study of Changes in Lexical Frequency." In Studies in the History of the
English Language VII: Generalizing vs. particularizing methodologies in
historical linguistic analysis, eds. Don Chapman, Colette Moore, and
Miranda Wilcox. Berlin: De Gruyter / Mouton. 131-50.
But ten years later, many corpus linguists are
still living 20 or 30 years in the past. There are very few articles in the main
journals in corpus linguistics (such as the
International Journal of Corpus
Linguistics or
Corpora or the
ICAME Journal)
that use very large datasets (like Google Books ngrams) or
anything like Large Language Models. The same is mostly true for the top
conferences in corpus linguistics, such as
Corpus Linguistics (2023). In addition, even
recent books urge corpus linguists to "hold fast" to the exacting standards
of 30-40 years ago (when corpora were much, much smaller than today), and even top researchers hearken back to
corpora from that
presumed "golden age"
of corpus linguistics as they create
new corpora
*.
|