Corpora Overview PDF Videos Resources Help / FAQ My account

English-Corpora.org

CORPORA AND AI / LLMs: Notes and disclaimers

Mark Davies / March 2025

1. Data and theory
I am a retired professor of (Corpus) Linguistics, not a computer scientist. While I am very interested in AI and LLMs at an "amateur" level, I know relatively little about the theory underlying machine learning, deep learning, neural networks, and so on. I have a close family member who is close to receiving a PhD in this field, and even though he has tried his best to explain some of these concepts to me, I must confess that my now-60+ year old brain has probably only internalized some of these concepts, and so any errors in analysis are mine alone. While there might be (even very basic) misunderstandings about the theory underlying LLMs in these white papers, my hope is that the data that I have presented here will be of benefit to those who are involved in trying to interpret better how LLMs work.

2. These tests are simply a “first attempt”
The “tests” in the various white papers are simply a first attempt at seeing how the predictions of LLMs compare to actual corpus data. Undoubtedly, some of these tests will seem strange to some researchers, who will wonder “Well, who would have thought that an LLM could do that anyway? Why even ask it to?” But that’s kind of the point. There are still many (even very basic) questions that are unanswered about what LLMs can do in terms of linguistic analysis and what they know about language. They were able to do some things that I had thought would be quite impossible (to paraphrase the great 18th century lexicographer Samuel Johnson), and so it doesn’t hurt to at least ask.

3. Other LLMs (DeepSeek, etc)
These papers only use ChatGPT-4o (from OpenAI) and Gemini (1.5 Pro) from Google. There are many other great LLMs out there, from Anthropic, Meta, DeepSeek, Perplexity, and more. I would very much like to have carried out these same searches in DeepSeek, especially since it is freely available to researchers (along with other LLMs like Llama, Qwen, Mistral, etc). But OpenSeek was released towards the end of this research, and it just wasn’t possible to start all over. In addition, many of the tables and charts in the white papers barely fit on the page, even with just two LLMs. I will leave it to other researchers to carry out similar queries with other LLMs and would happy to host (or at link to) the results at English-Corpora.org.

4. Creating benchmarks
Eventually, it would be nice to have some good corpus-informed “benchmarks” that could be used to test LLM performance in terms of language knowledge (word and phrase frequency, collocates, language variation, etc). These already exist for other domains, such as questions from tests like the LSAT (law schools), GMAT (business schools), MCAT (medical schools), etc. Maybe some of the tests in these white papers will give others some ideas on what might such linguistically-oriented benchmarks might look like.

5. This data in future LLMs
Now that these white pages are online, it will only be a matter of time before they are swallowed up by a web scraper and then become part of a future LLM. So 12 or 18 months from now, when you do one of the searches described in these pages, the LLM may give an answer that looks suspiciously like the corpus results. That’s why I’ve included links to the conversations from GPT and Gemini, so that in the future people can see how the LLMs actually responded back when these papers were being written.

6. Why "white papers"?
I have had what I feel is a productive career, and I have published widely and presented at many conferences (including many as keynote speaker). Why present this data in a "white paper" format, rather than as a journal article or a conference presentation? As far as a conference presentation, I will be presenting this as a keynote speaker at conferences in Spain in May 2025 and Germany in July 2025. And as far as a book chapter or a journal article (that's how we do things in corpus linguistics), I think that there is still a lot of opposition from the "old guard" in the field to new resources, especially to something like LLMs. There have been five or six times in my career where the old guard put heavy pressure on book or journal editors to reject papers that I'd submitted, simply because they thought that the new corpora that I'd created somehow "challenged" their existing corpora. (Yeah, I know that sounds paranoid, but it really has happened. And fortunately, these papers did end up getting published. Thank you, editors :-) But I'm just not in the mood for any more battles like that, especially now that I'm retired. So at least for the time being, this data will be in white papers, where I can present the data as I see it, without having to (again) go "toe to toe" with the corpus linguistics "old guard".