Corpora Overview PDF Videos Resources Help / FAQ My account

English-Corpora.org

Mark Davies / September 2025

AI has already reshaped research in many fields, like biology, climate science, archaeology, even law. In each case, new tools have made it possible to see patterns that were difficult (or impossible) to find before, and to ask questions that hadn’t even been considered.

Corpus linguistics is another field where researchers deal with large, complex datasets, and so it's not hard to imagine how AI could be used to generate advancements with corpora as well -- which would benefit researchers, teachers, and learners.

And yet here are few recent episodes that highlight resistance to new tools like AI/LLMs in the field of corpus linguistics. (I’ve tried to keep these examples anonymous.)

1. In Summer 2025, a very well-known corpus linguistics program put on a series of webinars related to "corpora and AI". They mainly focused on clichés about the limitations of AI ("don't even think about using AI / LLMs until they've completely solved the following eight problems..."), and they trotted out the same non-AI-enabled tools that they've been pushing for several years now, as "alternatives to (untrustworthy) AI". These supposedly cutting-edge webinars on "corpora and AI" provided nothing new or helpful, in terms of AI-powered tools that could actually be used to analyze corpora. And . . . they have more of the same scheduled for their MOOC in Autumn 2025.

2. At a top Corpus Linguistics conference in in Summer 2025, the well-known creator of a widely-used corpus tool (which has no AI functionality, and probably never will) wandered around from session to session, arguing incessantly with anyone and everyone who was talking about AI and corpora -- saying that AI simply has nothing to offer, as far as the analysis of corpus data. It's hard for me to imagine this happening in 2025 at a conference for molecular biology or astrophysics or neuroscience or climate science, or any number of other "data-oriented" fields, but apparently it's OK for corpus linguistics.

3. At another corpus linguistics conference in Summer 2025, a keynote speaker said that it would be a serious mistake to put corpora with AI / LLMs into the hands of students, because it would stunt the development of their "analytic capabilities". According to this researcher, the only way for them to really learn to analyze corpus data was the same way that it was done back in 2015 or 2005 or 1995 or 1985 -- by painstaking, manual analysis of hundreds of lines of KWIC / concordance lines. (But of course many language learners don't really want to be "junior corpus linguists" anyway -- they just want tools to help them understand another language.)

4. A recent book on designing and evaluating language corpora (by three top researchers in the field, and from a top academic publisher) argued throughout the book that 1) we really need to keep creating corpora the same way that we did in the early 1990s, and 2) all that really matters in a corpus are the texts in the corpus. Anything that makes the corpus more useful for the end user -- architecture, or interface (and of course anything like AI) -- isn't really worth discussing (at least not in their book).

5. I have been arguing for 10-15 years now (see Davies 2014, 2015, 2016 below) that very large text archives (like Google Books, or very large text archives) can yield very insightful data, even if those text collections are not a perfectly created "corpus". But if you look at the last 4-5 years of the top journals in corpus linguistics, you will see how few (if any) of the articles deal with large text archives like this, or even very large corpora like those from Sketch Engine.

6. To give a concrete example of #5, one of my sons is finishing up a PhD in Natural Language Processing / Machine Learning / AI. He was a researcher on a project that used a 100 billion word historical corpus of American newspapers and magazines to look at discussions of social issues over the last 150-200 years. As I compared their data to data from the 475 million word COHA corpus (lovingly and painstakingly created by myself in 2010, but just 1/200th the size of their "corpus"), I was absolutely blown away by how rich and insightful their data was, and how effortlessly they extracted the data using advanced NLP / ML techniques (such as neural networks). But crucially, their data came from collections of texts that violate many of the sacrosanct guidelines that are emphasized so much in corpus linguistics. Is anyone in the narrow field of corpus linguistics paying attention to research like this, which is based on extremely large text archives?

7. On the other hand, if you look at recent articles in the top journals in corpus linguistics, and you'll see how many articles deal with "field-internal" debates (in other words "navel gazing"), such as exactly what association measure we should use to find collocates, or how a different dispersion measure can produce slightly better frequency lists. I suspect that many of these researchers don't realize how much the ground has shifted under their feet, and that researchers in Natural Language Processing / Machine Learning / AI are using much more sophisticated methods (such as advanced neural networks ) to extract data from trillions of words of data.

And these are just the anecdotes that I am personally familiar with. Now that I am (technically) "retired", I only rarely attend conferences or participate in reviews of conference submissions and journal articles. As a result, I'm sure that I'm missing out on many, many other such anecdotes.

So my concern is that if the “AI skeptics” gain much more of a foothold, corpus linguistics may be pushed to the margins, increasingly dismissed by researchers in related fields as a quaint, inward-looking discipline, which is resistant to new approaches like AI.

To give just one example that has already happened, the CORPORA listserv uses the name "corpora", but the vast majority of the posts there are now from the fields of Natural Language Processing and Computational Linguistics, and the vast majority of their conference postings are related to AI. But NLP researchers at CORPORA rarely mention any work from corpus linguistics, they seem to be blithely unaware of what corpus linguists are doing, and there is very little interaction between the two fields. (I've been following the listserv for more than 20 years, and it didn't used to be this way. Once upon a time, corpus linguists were the majority there, and there were actual conversations between the two fields.)

So where now? We’re at a genuine fork in the road in terms of whether corpus linguistics as a field opens up to the possibilities of AI, or whether it retreats further into its own little parochial world. To illustrate the problem, I asked ChatGPT, Gemini, and Claude to create a story about an early-1900s carriage maker who refuses to adapt to automobiles -- and then slips into irrelevance. For corpus linguistics, the parallel should be uncomfortably clear.

And to elaborate a bit on that analogy, of course there were problems with automobiles (noise, pollution, safety, use of fossil fuels, etc). But wandering around town, mumbling about "how to make a better horse-drawn carriage" in the early 1900s was undoubtedly the wrong approach. In retrospect, the best course of action would have been to take the knowledge and skills of a carriage maker, and to find a way to make those skills applicable in the new world of "horseless carriages".

My hope is that corpus linguists choose relevance -- by embracing AI/LLMs, in conjunction with corpus data -- rather than letting "AI skeptics" have their way. And if we wait for "Planck's Principle" to take care of things (even with a more "gentle" interpretation), it may be too late. We need to start accepting change now -- the future of our field depends on it.

References

Davies, Mark (2014) "Making Google Books n-grams useful for a wide range of research on language change". International Journal of Corpus Linguistics 19 (3): 401-16.
Davies, Mark (2015) "Corpora: An Introduction". In Cambridge Handbook of English Corpus Linguistics, eds. Douglas Biber and Randi Reppen. Cambridge: Cambridge University Press. 11-31.
Davies, Mark and Don Chapman. (2016) "The Effect of Representativeness and Size in Historical Corpora: An Empirical Study of Changes in Lexical Frequency." In Studies in the History of the English Language VII: Generalizing vs. particularizing methodologies in historical linguistic analysis, eds. Don Chapman, Colette Moore, and Miranda Wilcox. Berlin: De Gruyter / Mouton. 131-50.