English-Corpora.org

English-Corpora.org


CORPORA AND AI / LLMs: Some notes on corpus linguistics General overview | Notes | Citing

Mark Davies / March 2025

Although I am now "retired", I worked in the field of corpus linguistics for more than 35 years, and I published widely and gave keynote talks at many of the top conferences in the field. But I have always been concerned about how "insular" and "provincial" corpus linguists can be in terms of not paying attention to developments and progress outside of this very narrow field of research.  I sometimes worry that because of this attitude, corpus linguistics risks becoming somewhat irrelevant, and will in the future be viewed as a "quaint, obsolete" field of research by researchers in other fields that have passed corpus linguistics by. That is why I have taken the time to write this series of "white papers" on how corpus linguists and language teachers can use LLMs to supplement and enhance their research and teaching. Not to have the LLMs replace corpora, but to be used in conjunction with corpora.


By way of background, I moved from the field of Spanish and Portuguese historical syntax to the field of corpus linguistics (and started working mostly with English data) when I moved from Illinois State University to Brigham Young University in 2003. It seemed like I had to spend ten years or more after that try to get people to use new corpora (some of which I had created), rather than simply rehashing the same old topics (modal and auxiliary verbs, and auxiliary verbs and modals) with the same corpora they had been using for 20-30 years.

When I was asked to write the lead chapter in the Cambridge Handbook of English Corpus Linguistics (Davies 2015), I suggested that corpus linguists might want to consider using very large datasets in addition to their small, lovingly created "boutique" corpora, and I predicted that even larger and more powerful tools might soon be available. In other publications (Davies 2014, Davies 2016) I made similar suggestions, and I tried to get other corpus linguists to pay attention to very large datasets like Google Books ngrams.

But ten years later, many corpus linguists are still living 20 or 30 years in the past. There are very few articles in the main journals in corpus linguistics (such as the International Journal of Corpus Linguistics or Corpora or the ICAME Journal) that use very large datasets (like Google Books ngrams) or anything like Large Language Models. The same is mostly true for the top conferences in corpus linguistics, such as Corpus Linguistics (2023). In addition, even recent books urge corpus linguists to "hold fast" to the exacting standards of 30-40 years ago (when corpora were much, much smaller than today), and even top researchers hearken back to corpora from that presumed "golden age" of corpus linguistics as they create new corpora *.

Even with this general reluctance to change, however, there might finally be some light on the horizon. At the last academic conference that I attended before I retired in 2020, I was pleased to see that "newer" corpora like those from English-Corpora.org were being used as much or more than the older corpora from 20-30 years before. And in 2025 (as this page is being written), English-Corpora.org is the most widely-used corpora of English by teachers and students.

As far as LLMs and similar resources, in the most recent ICAME conference (Vigo, 2024), there were actually two or three papers (from the roughly 150 papers at the conference) that referenced ChatGPT. And in the most recent Teaching and Language Corpora conference (2024), there were a handful of articles (from the 100 or so at the conference) that mentioned LLMs as well. (My experience is that people who use corpora for teaching seem to be more open to new technologies than "old-school" researchers of English.)


So if some corpus linguists are now using new corpora, large datasets, and even LLMs (to a very limited extent), then what is the problem? The problem is that there is still a very influential "old guard" contingent of corpus linguists -- some of them in prominent positions -- who are quite opposed to the use of these new resources. Some of the "old guard" actively oppose the use of new resources, which they feel challenge their existing corpora (see #6 here). But many others are content to simply ignore these new resources, with the belief that if they ignore them long enough, they will somehow go away.

The downside of this "head in the sand" approach, as mentioned, is that the entire field may end up becoming increasingly irrelevant, and being ignored by researchers in what could and should be allied fields. For example, the CORPORA listserv uses the name "corpora", but the majority of the posts there are now from the fields of Natural Language Processing and Computational Linguistics. NLP researchers at CORPORA rarely mention any work from corpus linguistics, they seem to be blithely unaware of what corpus linguists are doing, and there is very little interaction between the two fields. (I've been following the listserv for more than 20 years, and it didn't used to be this way. Once upon a time, corpus linguists were actually the majority there, and there were actual conversations between the two fields.)

So where to go from here? Just for fun, I had GPT create two stories about people who were in fields that were at risk of becoming irrelevant in the face of large, disruptive changes. One story is about a carriage maker in the early 1900s who refuses to adapt in the face of the "newfangled automobiles", and slips into irrelevance. The other is about the owner of an Italian restaurant (scroll down a bit in the link), who is initially reluctant to believe that the Italian food from "big box" retailers (think LLMs) can in any way compete with the lovingly, carefully crafted food at his cozy Italian restaurant (think small, beloved 100 million word corpora). But he eventually realizes that he can incorporate key insights from the "big box" retailers into what he offers at his restaurant, and he thus has "the best of both worlds".

My hope is that corpus linguists choose to stay relevant in this "brave new world" of LLMs, and that they begin to use insights from LLMs -- along with actual corpus data -- in their teaching and research. Hopefully these white papers will help in that effort.


References

  • Davies, Mark (2014) "Making Google Books n-grams useful for a wide range of research on language change". International Journal of Corpus Linguistics 19 (3): 401-16.

  • Davies, Mark (2015) "Corpora: An Introduction". In Cambridge Handbook of English Corpus Linguistics, eds. Douglas Biber and Randi Reppen. Cambridge: Cambridge University Press. 11-31.

  • Davies, Mark and Don Chapman. (2016) "The Effect of Representativeness and Size in Historical Corpora: An Empirical Study of Changes in Lexical Frequency." In Studies in the History of the English Language VII: Generalizing vs. particularizing methodologies in historical linguistic analysis, eds. Don Chapman, Colette Moore, and Miranda Wilcox. Berlin: De Gruyter / Mouton. 131-50.