Mark Davies / March 2025
Although I am
now "retired",
I worked in the field
of corpus linguistics for more than 35 years, and I
published widely and gave
keynote talks at many of the top conferences in the field. But I have always
been concerned about how "insular" and "provincial" corpus linguists can be in
terms of not paying attention to developments and progress outside of this very
narrow field of research. I sometimes worry that because of this attitude, corpus
linguistics risks becoming somewhat irrelevant, and will in the future be viewed as a
"quaint, obsolete" field of research by researchers in other fields that have
passed corpus linguistics by. That is why I have taken the time to write this
series of "white papers" on how corpus linguists and language teachers can use
LLMs to supplement and enhance their research and teaching. Not to have the LLMs replace corpora, but
to be used in conjunction with corpora.
By way of background, I moved from the field of Spanish and Portuguese
historical syntax to the field of corpus linguistics (and started working
mostly with English data) when I moved from Illinois State University to Brigham
Young University in 2003. It seemed like I had to spend ten years or more after
that try to get people to use new
corpora (some of which I had created), rather than simply rehashing the same
old topics (modal and auxiliary verbs, and auxiliary verbs and modals) with the same corpora
they had been using for 20-30 years.
When I was asked to write the lead chapter in the
Cambridge Handbook of English Corpus Linguistics (Davies
2015), I suggested that corpus linguists might want to consider
using very large datasets in addition to their small, lovingly created "boutique" corpora, and I
predicted that even larger and more powerful tools might soon be available. In other publications (Davies
2014,
Davies 2016) I made similar suggestions, and I tried to get other corpus
linguists to pay attention to very large datasets like
Google Books ngrams.
But ten years later, many corpus linguists are
still living 20 or 30 years in the past. There are very few articles in the main
journals in corpus linguistics (such as the
International Journal of Corpus
Linguistics or
Corpora or the
ICAME Journal)
that use very large datasets (like Google Books ngrams) or
anything like Large Language Models. The same is mostly true for the top
conferences in corpus linguistics, such as
Corpus Linguistics (2023). In addition, even
recent books urge corpus linguists to "hold fast" to the exacting standards
of 30-40 years ago (when corpora were much, much smaller than today), and even top researchers hearken back to
corpora from that
presumed "golden age"
of corpus linguistics as they create
new corpora
*.
Even with this general reluctance to change,
however, there might finally be some light on the horizon. At the last academic
conference that I attended before I retired in 2020, I was pleased to see
that "newer" corpora like those from
English-Corpora.org
were being used as much or more than the older corpora from 20-30 years before. And
in 2025 (as this page is being written), English-Corpora.org is the
most widely-used corpora
of English by teachers and students.
As far as LLMs and similar resources, in
the most recent
ICAME conference (Vigo, 2024), there were actually two or three papers
(from the roughly 150 papers at the conference) that referenced ChatGPT. And in
the most recent
Teaching and Language Corpora conference (2024), there were a handful of
articles (from the 100 or so at the conference) that mentioned LLMs as well. (My
experience is that people who use corpora for teaching seem to be more open to
new technologies than "old-school" researchers of English.)
So if some corpus linguists are now using
new corpora, large datasets, and even LLMs (to a very limited extent), then what is
the problem? The problem is that there is still a very influential "old guard"
contingent of corpus linguists -- some of them in prominent positions -- who are
quite opposed to the use of these new resources. Some of the "old guard"
actively oppose the use of new resources, which they feel challenge their
existing corpora (see #6 here). But many
others are content to simply ignore these new resources, with the belief that if
they ignore them long enough, they will somehow go away.
The downside of this "head in the sand"
approach, as mentioned, is that the entire field may end up becoming
increasingly irrelevant,
and being ignored by researchers in what could and should be allied fields. For
example, the
CORPORA listserv uses the name "corpora", but the majority of the
posts there are now from the fields of Natural Language Processing and Computational
Linguistics. NLP researchers at CORPORA rarely mention any work from corpus
linguistics, they seem to be blithely unaware of what corpus linguists are doing, and
there is very little interaction between the two fields. (I've been following
the listserv for more than 20 years, and it didn't used to be this way. Once
upon a time, corpus linguists were actually the majority there, and there were actual conversations between the two fields.)
So where to go from here? Just for fun, I
had GPT create two stories about people who were in fields that were at risk of
becoming irrelevant in the face of large, disruptive changes. One story is about
a
carriage maker
in the early 1900s who refuses to adapt in the face of the
"newfangled automobiles", and slips into irrelevance. The other is about the
owner of an Italian restaurant
(scroll down a bit in the link), who is initially reluctant to believe
that the Italian food from "big box" retailers (think LLMs) can in any way compete with the
lovingly, carefully crafted food at his cozy Italian restaurant (think small, beloved 100
million word corpora). But he eventually realizes that he can incorporate key
insights from the "big box" retailers into what he offers at his
restaurant, and he thus has "the best of both worlds".
My hope is that corpus linguists choose to stay
relevant in this "brave new world" of LLMs, and that they begin to use
insights from LLMs -- along with actual corpus data -- in their
teaching and research. Hopefully these white papers will help in that effort.
References
-
Davies, Mark (2014) "Making Google Books
n-grams useful for a wide range of research on language change".
International Journal of Corpus Linguistics 19 (3): 401-16.
-
Davies, Mark (2015) "Corpora: An
Introduction". In Cambridge Handbook of English Corpus Linguistics,
eds. Douglas Biber and Randi Reppen. Cambridge: Cambridge University Press.
11-31.
-
Davies, Mark and Don Chapman. (2016) "The
Effect of Representativeness and Size in Historical Corpora: An Empirical
Study of Changes in Lexical Frequency." In Studies in the History of the
English Language VII: Generalizing vs. particularizing methodologies in
historical linguistic analysis, eds. Don Chapman, Colette Moore, and
Miranda Wilcox. Berlin: De Gruyter / Mouton. 131-50.
|