| 
 
			Mark Davies / March 2025 
			1. Data and theoryI am a retired professor of (Corpus) Linguistics, not a computer 
			scientist. While I am very interested in AI and LLMs at an "amateur" 
			level, I know relatively little about the theory underlying 
			machine learning, deep learning, neural networks, and so on. I have 
			a close family member who is close to receiving a PhD in this field, 
			and even though he has tried his best to explain some of these concepts 
			to me, I must confess that my now-60+ year old brain has probably 
			only internalized some of these concepts, and so any errors in 
			analysis are mine alone. While there might be (even very basic) 
			misunderstandings about the theory underlying LLMs in these 
			white papers, my hope is that the data that I have presented 
			here will be of benefit to those who are involved in trying 
			to interpret better how LLMs work.
 2. These tests are 
			simply a “first attempt”The “tests” in the various white papers are simply a first 
			attempt at seeing how the predictions of LLMs compare to actual 
			corpus data. Undoubtedly, some of these tests will seem strange to 
			some researchers, who will wonder “Well, who would have thought that 
			an LLM could do that anyway? Why even ask it to?” But that’s kind of 
			the point. There are still many (even very basic) questions that are 
			unanswered about what LLMs can do in terms of linguistic analysis 
			and what they know about language. They were able to do some things 
			that I had thought would be quite impossible (to
			
			paraphrase the great 18th century lexicographer 
			Samuel Johnson), and so it doesn’t hurt 
			to at least ask.
 3. Other LLMs (DeepSeek, 
			etc)These papers only use ChatGPT-4o (from OpenAI) and Gemini (1.5 Pro) 
			from Google. There are many other great LLMs out there, from 
			Anthropic, Meta, DeepSeek, Perplexity, and more. I would very much 
			like to have carried out these same searches in DeepSeek, especially 
			since it is freely available to researchers (along with other LLMs 
			like Llama, Qwen, Mistral, etc). But OpenSeek was released 
			towards the end of this research, and it just wasn’t possible to 
			start all over. In addition, many of the tables and charts in the 
			white papers barely fit on the page, even with just two LLMs. I will 
			leave it to other researchers to carry out similar queries with 
			other LLMs and would happy to host (or at link to) the results at 
			English-Corpora.org.
 4. Creating 
			benchmarksEventually, it would be nice to have some good corpus-informed 
			“benchmarks” that could be used to test LLM performance in terms of 
			language knowledge (word and phrase frequency, collocates, language 
			variation, etc). These already exist for other domains, such as 
			questions from tests like the LSAT (law schools), GMAT (business 
			schools), MCAT (medical schools), etc. Maybe some of the tests in 
			these white papers will give others some ideas on what might such 
			linguistically-oriented benchmarks might look like.
 5. This data in 
			future LLMsNow that these white pages are online, it will only be a matter 
			of time before they are swallowed up by a web scraper and then become part of a future LLM. So 12 or 18 months from now, when you 
			do one of the searches described in these pages, the LLM may give an 
			answer that looks suspiciously like the corpus results. That’s why 
			I’ve included links to the conversations from GPT and Gemini, so 
			that in the future people can see how the LLMs actually responded 
			back when these papers were being written.
 
6. Why "white papers"?I have had what I feel is a 
productive career, and I have
published widely and 
presented at many conferences (including many as keynote speaker). Why present 
this data in a "white paper" format, rather than as a journal article or a 
conference presentation? As far as a conference presentation, I recently 
presented this as a keynote speaker at conferences in
Spain in May 2025 and
Germany in July 2025. And as far as a book chapter or a journal article (that's how we do things 
in corpus linguistics), I think that there is still a lot of opposition from the 
"old guard" in the field to new 
resources, especially to something like LLMs. There have been five or six times in 
my career where the old guard put heavy pressure on book or journal editors to reject 
papers that I'd submitted, simply because they thought that the new
corpora that I'd created somehow 
"challenged" their existing corpora. (Yeah, I know that sounds paranoid, but it 
really has happened. And fortunately, these papers did end up getting
published. So thank you, 
editors :-) But I'm just not in the mood for any more battles like that, especially 
now that I'm retired. So at least for the time being, this data will be in white 
papers, where I can present the data as I see it, without having to (once again) go 
go to battle against the corpus linguistics "old guard".
 
 
 
 
 
 |