English-Corpora.org

English-Corpora.org


CORPORA AND AI / LLMs: Some notes on corpus linguistics  

Mark Davies / September 2025
 

Although I am now "retired", I was a corpus linguist for more than 35 years, and I published widely and gave keynote talks at many of the top conferences in the field. I have also created (pretty much by myself) what are the most widely used online corpora, which are used by thousands of researchers every year, and which results in more than 1,000 academic publications each year. In addition, these corpora (especially their full-text and word frequency data) have been used by many large tech companies in the development of many products -- some of which you probably use each day.

But I have always been concerned about how "insular" and "provincial" corpus linguists can be, in terms of not paying attention to developments and progress outside of this very narrow field of research. I sometimes worry that because of this attitude, corpus linguistics risks becoming somewhat irrelevant, and that it will in the future be viewed as a "quaint, obsolete, inward-looking" field of research, by researchers in other fields that have taken its place.



As in most academic fields, in corpus linguistics there is an influential "old guard" -- some of them in prominent positions -- who are quite opposed to the use of new resources and approaches. Sometimes they feel that these resources challenge the corpora that they carefully and lovingly created 10, 20, or even 30 years ago. Sometimes they are just "traditionalists" -- people who think that others should be doing things the same way that they did them back when they were graduate students in the 1980s or 1990s. (And it's not always just the older researchers -- some younger researchers have learned that they can sycophantically advance their career by playing to the nostalgia of the older, more influential researchers.)

The following are some anecdotal evidences for this concern about the state of the field, especially in terms of its opposition to new resources like AI / LLMs. (I'm going to try to exercise restraint by not including links that would identify specific people and venues. But "if you know, you know" :-)

1. In Summer 2025, a well-known corpus linguistics program put on a series of webinars related to "corpora and AI". They mainly focused on old tropes about the limitations of AI ("don't even think about using AI / LLMs until they've completely solved the following eight problems..."), and they trotted out the same non-AI-enabled tools that they've been pushing for several years now, as "alternatives to (untrustworthy) AI". These supposedly cutting-edge webinars on "corpora and AI" proposed nothing new or helpful, in terms of AI-powered tools that could actually be used to analyze corpora.

2. At a top Corpus Linguistics conference in in Summer 2025, the well-known creator of a widely-used tool for corpus retrieval (which of course has no AI capabilities, and probably never will) wandered around from session to session, arguing incessantly with anyone and everyone who claimed that insights from AI could be integrated with corpus data. It's hard for me to imagine this happening in 2025 at a conference for molecular biology or astrophysics or neuroscience or climate science, or any number of other "data-oriented" fields.

3. At another corpus linguistics conference in Summer 2025, a keynote speaker said that it would be a serious mistake to put corpora with AI / LLMs into the hands of students, because it would stunt the development of their "analytic capabilities". The only way for them to really learn to analyze corpus data was the same way that it was done back in 2015 and 2005 and 1995 and 1985 -- by painstaking, manual analysis of hundreds of lines of KWIC / concordance lines. (But of course many language learners don't really want to be "junior corpus linguists" anyway -- they just want tools to help them learn another language.)

4. A recent book on designing and evaluating language corpora (by three top researchers in the field, and from a top academic publisher) argued throughout the book that 1) we really need to keep creating corpora the same way that we did in the early 1990s, and 2) all that really matters in a corpus are the texts in the corpus. Anything that makes the corpus more useful for the end user -- architecture, or interface (and of course anything like AI) -- isn't really worth discussing (at least not in their book).

5. I have been arguing for 10-15 years now (see Davies 2014, 2015, 2016 below) that very large text archives (like Google Books, or very large text archives) can yield very insightful data, even if those text collections are not a perfectly created corpus. But if you look at the last 4-5 years of the top journals in corpus linguistics, you will see how few (if any) of the articles deal with large text archives like this, or even very large corpora like those from Sketch Engine.

6. To give a concrete example of #5, one of my sons is finishing up a PhD in Natural Language Processing / Machine Learning/ AI. He was a researcher on a project that used a 100 billion word historical corpus of American newspapers and magazines to look at discussions of social issues over the last 150-200 years. As I compared their data to data from the 475 million word COHA corpus (lovingly and painstakingly created by me in 2010, but just 1/200th the size of their "corpus") -- I was absolutely blown away by how rich and insightful their data was, and how effortlessly they extracted the data using advanced NLP / ML techniques (such as neural networks). But crucially, their data came from collections of texts that violate many of the sacrosanct guidelines that are emphasized so much in corpus linguistics. Is anyone in the narrow field of corpus linguistics paying attention to articles like this?

7. On the other hand, if you look at recent articles in the top journals in corpus linguistics, and you'll see how many articles there deal with "field-internal" debates (i.e. "navel gazing"), such as exactly what association measure we should use to find collocates, or how a different dispersion measure can produce slightly better frequency lists. But many of these researchers are just "big fishes in a small pond". They really don't understand that the ground has shifted under their feet, and that researchers in Natural Language Processing / Machine Learning/ AI are using much more sophisticated methods (such as advanced neural networks ) to extract data from trillions of words of data.

And these are just the anecdotes that I am personally familiar with. Now that I am (technically) "retired", I only rarely attend conferences or participate in reviews of conference submissions and journal articles. As a result, I'm sure that I'm missing out on many, many other such anecdotes.

My overall concern is that the entire field of corpus linguistics may end up becoming increasingly irrelevant, and being ignored by researchers in what could and should be allied fields. For example, the CORPORA listserv uses the name "corpora", but the vast majority of the posts there are now from the fields of Natural Language Processing and Computational Linguistics, and the vast majority of their conference postings are related to AI. But NLP researchers at CORPORA rarely mention any work from corpus linguistics, they seem to be blithely unaware of what corpus linguists are doing, and there is very little interaction between the two fields. (I've been following the listserv for more than 20 years, and it didn't used to be this way. Once upon a time, corpus linguists were actually the majority there, and there were actual conversations between the two fields.)



So where to go from here? It seems like we are at the proverbial "fork in the road" with AI and corpora right now. To make the issue a bit more interesting and relatable, I had ChatGPT, Gemini, and Claude create stories (which they all did in just 1-2 seconds!) about a carriage maker in the early 1900s who refuses to adapt in the face of the "newfangled automobiles", and who slips into irrelevance as a result of that. For those of you in the field of corpus linguistics, how many similarities do you see between those stories and what you see happening in the field right now?

To elaborate a bit on that analogy, of course there were problems with automobiles (noise, pollution, safety, use of fossil fuels, etc). But wandering around town, mumbling about "how to make a better horse-drawn carriage" in the early 1900s was undoubtedly the wrong approach. In retrospect, the best course of action would have been to take the knowledge and skills of a carriage maker, and to find a way to make those skills applicable in the new world of "horseless carriages".

My hope is that corpus linguists will choose to stay relevant in this "brave new world" of LLMs, and that we can more fully embrace the use of AI/ LLMs, along with actual corpus data -- in our teaching and research. The future of our field depends on this.
 



References

  • Davies, Mark (2014) "Making Google Books n-grams useful for a wide range of research on language change". International Journal of Corpus Linguistics 19 (3): 401-16.

  • Davies, Mark (2015) "Corpora: An Introduction". In Cambridge Handbook of English Corpus Linguistics, eds. Douglas Biber and Randi Reppen. Cambridge: Cambridge University Press. 11-31.

  • Davies, Mark and Don Chapman. (2016) "The Effect of Representativeness and Size in Historical Corpora: An Empirical Study of Changes in Lexical Frequency." In Studies in the History of the English Language VII: Generalizing vs. particularizing methodologies in historical linguistic analysis, eds. Don Chapman, Colette Moore, and Miranda Wilcox. Berlin: De Gruyter / Mouton. 131-50.

 

 

 

 

 

 

 

 

 

 

 

 

 

But ten years later, many corpus linguists are still living 20 or 30 years in the past. There are very few articles in the main journals in corpus linguistics (such as the International Journal of Corpus Linguistics or Corpora or the ICAME Journal) that use very large datasets (like Google Books ngrams) or anything like Large Language Models. The same is mostly true for the top conferences in corpus linguistics, such as Corpus Linguistics (2023). In addition, even recent books urge corpus linguists to "hold fast" to the exacting standards of 30-40 years ago (when corpora were much, much smaller than today), and even top researchers hearken back to corpora from that presumed "golden age" of corpus linguistics as they create new corpora *.