Mark Davies / March 2025
1. Data and theory
I am a retired professor of (Corpus) Linguistics, not a computer
scientist. While I am very interested in AI and LLMs at an "amateur"
level, I know relatively little about the theory underlying
machine learning, deep learning, neural networks, and so on. I have
a close family member who is close to receiving a PhD in this field,
and even though he has tried his best to explain some of these concepts
to me, I must confess that my now-60+ year old brain has probably
only internalized some of these concepts, and so any errors in
analysis are mine alone. While there might be (even very basic)
misunderstandings about the theory underlying LLMs in these
white papers, my hope is that the data that I have presented
here will be of benefit to those who are involved in trying
to interpret better how LLMs work.
2. These tests are
simply a “first attempt”
The “tests” in the various white papers are simply a first
attempt at seeing how the predictions of LLMs compare to actual
corpus data. Undoubtedly, some of these tests will seem strange to
some researchers, who will wonder “Well, who would have thought that
an LLM could do that anyway? Why even ask it to?” But that’s kind of
the point. There are still many (even very basic) questions that are
unanswered about what LLMs can do in terms of linguistic analysis
and what they know about language. They were able to do some things
that I had thought would be quite impossible (to
paraphrase the great 18th century lexicographer
Samuel Johnson), and so it doesn’t hurt
to at least ask.
3. Other LLMs (DeepSeek,
etc)
These papers only use ChatGPT-4o (from OpenAI) and Gemini (1.5 Pro)
from Google. There are many other great LLMs out there, from
Anthropic, Meta, DeepSeek, Perplexity, and more. I would very much
like to have carried out these same searches in DeepSeek, especially
since it is freely available to researchers (along with other LLMs
like Llama, Qwen, Mistral, etc). But OpenSeek was released
towards the end of this research, and it just wasn’t possible to
start all over. In addition, many of the tables and charts in the
white papers barely fit on the page, even with just two LLMs. I will
leave it to other researchers to carry out similar queries with
other LLMs and would happy to host (or at link to) the results at
English-Corpora.org.
4. Creating
benchmarks
Eventually, it would be nice to have some good corpus-informed
“benchmarks” that could be used to test LLM performance in terms of
language knowledge (word and phrase frequency, collocates, language
variation, etc). These already exist for other domains, such as
questions from tests like the LSAT (law schools), GMAT (business
schools), MCAT (medical schools), etc. Maybe some of the tests in
these white papers will give others some ideas on what might such
linguistically-oriented benchmarks might look like.
5. This data in
future LLMs
Now that these white pages are online, it will only be a matter
of time before they are swallowed up by a web scraper and then become part of a future LLM. So 12 or 18 months from now, when you
do one of the searches described in these pages, the LLM may give an
answer that looks suspiciously like the corpus results. That’s why
I’ve included links to the conversations from GPT and Gemini, so
that in the future people can see how the LLMs actually responded
back when these papers were being written.
6. Why "white papers"?
I have had what I feel is a
productive career, and I have
published widely and
presented at many conferences (including many as keynote speaker). Why present
this data in a "white paper" format, rather than as a journal article or a
conference presentation? As far as a conference presentation, I will be
presenting this as a keynote speaker at conferences in
Spain in May 2025 and
Germany in July 2025. And as far as a book chapter or a journal article (that's how we do things
in corpus linguistics), I think that there is still a lot of opposition from the
"old guard" in the field to new
resources, especially to something like LLMs. There have been five or six times in
my career where the old guard put heavy pressure on book or journal editors to reject
papers that I'd submitted, simply because they thought that the new
corpora that I'd created somehow
"challenged" their existing corpora. (Yeah, I know that sounds paranoid, but it
really has happened. And fortunately, these papers did end up getting
published. Thank you,
editors :-) But I'm just not in the mood for any more battles like that, especially
now that I'm retired. So at least for the time being, this data will be in white
papers, where I can present the data as I see it, without having to (again) go
"toe to toe" with the corpus linguistics "old guard".
|