Mark Davies / September 2025
AI has already reshaped research in
many fields, like biology, climate science, archaeology, even law. In
each case, new tools have made it possible to see patterns that were difficult
(or impossible) to find before, and to ask questions that hadn’t even been
considered.
Corpus linguistics is another field where
researchers deal with large, complex datasets, and so it's not hard to imagine
how AI could be used to generate advancements with corpora as well -- which
would benefit researchers, teachers, and learners.
And yet here are few recent episodes that highlight
resistance to new tools like
AI/LLMs in the field of corpus linguistics. (I’ve tried to keep these
examples anonymous.)
1. In Summer 2025, a very well-known corpus linguistics program put on a series of webinars
related to "corpora and AI". They mainly focused on clichés about the limitations of AI ("don't even think about using
AI / LLMs until they've completely solved the following eight problems..."), and
they trotted out the same non-AI-enabled tools that they've been pushing for several
years now, as "alternatives to (untrustworthy) AI". These supposedly
cutting-edge webinars on "corpora and AI" provided nothing new or helpful, in
terms of AI-powered tools that could actually be used to analyze corpora. And .
. . they have more of the same scheduled for their MOOC in Autumn 2025.
2. At a top
Corpus Linguistics conference in in Summer 2025, the well-known
creator of a widely-used corpus tool (which has no AI functionality, and
probably never will) wandered around from session to session, arguing
incessantly with anyone and everyone who was talking about AI and corpora --
saying that AI simply has nothing to offer, as far as the analysis of corpus
data. It's hard for me to imagine this happening in 2025 at a conference for
molecular biology or astrophysics or neuroscience or climate science, or any
number of other "data-oriented" fields, but apparently it's OK for corpus
linguistics.
3. At another corpus linguistics conference in
Summer 2025, a keynote speaker said that it would be a serious mistake to put corpora with AI /
LLMs into the hands of students, because it would stunt the development of their "analytic capabilities".
According to this researcher, the only way for them to really learn to
analyze corpus data was the same way that it was done back in 2015 or 2005 or 1995
or 1985 -- by
painstaking, manual analysis of hundreds of lines of KWIC / concordance lines.
(But of course many language learners don't really want to be "junior corpus
linguists" anyway -- they just want tools to help them understand another language.)
4. A recent book on designing and evaluating
language corpora (by three top researchers in the field, and from a top academic
publisher) argued throughout the book that 1) we really need to keep creating
corpora the same way that we did in the early 1990s, and 2) all that really matters in
a corpus are the texts in the corpus. Anything that makes the corpus more
useful for the end user -- architecture, or interface (and of course anything
like AI) -- isn't really worth discussing (at least not in their book).
5. I have been arguing for 10-15 years now (see
Davies 2014, 2015, 2016 below) that very large text archives (like Google
Books, or very large text archives) can yield very insightful data, even if those text
collections are not a perfectly created "corpus". But if you look at the last 4-5
years of the top journals in corpus linguistics, you will see how few (if any)
of the articles deal with large text archives like this, or even very large
corpora like those from Sketch Engine.
6. To give a concrete example of #5, one of my sons
is finishing up a PhD in Natural Language Processing / Machine Learning / AI. He was a
researcher on a project
that used a 100 billion
word historical corpus of American newspapers and magazines to look at
discussions of social issues over the last
150-200 years. As I compared their data to data from the 475
million word
COHA corpus (lovingly and
painstakingly created by me in 2010, but just 1/200th the size of their
"corpus"), I was absolutely blown away by how rich and insightful their data was,
and how effortlessly they extracted the data using advanced NLP / ML techniques
(such as neural networks). But crucially, their data came from collections of
texts that violate many of the
sacrosanct guidelines that are emphasized so much in corpus linguistics.
Is anyone in the narrow field of corpus linguistics paying attention to research
like this, which is based on extremely large text archives?
7. On the other hand, if you look at recent articles in the top journals
in corpus linguistics, and you'll see how many articles deal with
"field-internal" debates (in other words "navel gazing"), such as exactly what association measure we should use
to find collocates, or how a different dispersion measure can produce
slightly better frequency lists. I suspect that many of these researchers don't
realize how much the ground has
shifted under their feet, and that researchers in Natural Language Processing /
Machine Learning / AI are using much more sophisticated methods (such as advanced
neural networks ) to extract data from trillions of words of data.
And these are just the anecdotes that I am
personally familiar with. Now that I am (technically) "retired", I only rarely
attend conferences or participate in reviews of conference submissions and journal
articles. As a result, I'm sure that I'm missing out on many, many other such anecdotes.
(But please feel free to contact me
if you're aware of similar cases, and I may include them here.)
So my concern is that if the “AI skeptics”
gain much more of a foothold, corpus linguistics may be pushed to the
margins, increasingly dismissed by researchers in related fields as a
quaint, inward-looking discipline resistant to new approaches like AI. |
To give just one example that has already
happened, the
CORPORA listserv uses the name "corpora", but the vast majority of the
posts there are now from the fields of Natural Language Processing and Computational
Linguistics, and the vast majority of their conference postings are related to
AI. But NLP researchers at CORPORA rarely mention any work from corpus
linguistics, they seem to be blithely unaware of what corpus linguists are doing, and
there is very little interaction between the two fields. (I've been following
the listserv for more than 20 years, and it didn't used to be this way. Once
upon a time, corpus linguists were the majority there, and there were actual conversations between the two fields.)
So where now? We’re at a genuine fork in the
road in terms of whether corpus linguistics as a field opens up to the
possibilities of AI, or whether it retreats further into its own little
parochial world. To
illustrate the problem, I asked
ChatGPT,
Gemini, and
Claude
to create a story about an early-1900s carriage maker who refuses to adapt to
automobiles -- and then slips into irrelevance. For corpus linguistics, the parallel
should be uncomfortably clear.
And to elaborate a bit on that analogy, of course there were problems with automobiles
(noise, pollution, safety, use of fossil fuels, etc). But wandering around town,
mumbling about "how to
make a better horse-drawn carriage" in the early 1900s was undoubtedly the wrong approach.
In retrospect, the best course of action would have been to take the knowledge
and skills of a carriage maker, and to find a way to make those skills
applicable in the new world of "horseless
carriages".
My hope is that corpus linguists choose relevance
-- by embracing AI/LLMs, in conjunction with corpus data -- rather than letting
"AI skeptics" have their way. And if we wait for "Planck's
Principle" to take care of things (even with a
more "gentle"
interpretation), it may be too late. We need to start
accepting change now -- the future of our field depends on it.
References
-
Davies, Mark (2014) "Making Google Books
n-grams useful for a wide range of research on language change".
International Journal of Corpus Linguistics 19 (3): 401-16.
-
Davies, Mark (2015) "Corpora: An
Introduction". In Cambridge Handbook of English Corpus Linguistics,
eds. Douglas Biber and Randi Reppen. Cambridge: Cambridge University Press.
11-31.
-
Davies, Mark and Don Chapman. (2016) "The
Effect of Representativeness and Size in Historical Corpora: An Empirical
Study of Changes in Lexical Frequency." In Studies in the History of the
English Language VII: Generalizing vs. particularizing methodologies in
historical linguistic analysis, eds. Don Chapman, Colette Moore, and
Miranda Wilcox. Berlin: De Gruyter / Mouton. 131-50.
|