English-Corpora.org

English-Corpora.org


Why not just use AI instead of corpora?  

Mark Davies (2026)

Why not just use AI (via LLMs: Large Language Models) to get language data, instead of a well-designed corpus? For example, why not just ask ChatGPT or Gemini or Claude to generate the following linguistic data for English:

  1. (word frequency) find the most frequent words that match certain patterns, such as *break* (Section #5 here)

  2. (phrase frequency) find nouns that occur the most after the word dark (#5 here)

  3. (genres) find words that are more common in fiction than academic, or vice versa (Section #3 here)

  4. (historical) find adjectives that were more common in the 1800s than in the 1980s-2010s (Section #6 here)

This would be much easier than learning to use a corpus. With a corpus you need to learn the search syntax -- for example (for #2 above) dark NOUN at English-Corpora.org (click that link to see the results from COCA), or something much more complicated like [word="dark"] [tag="N.*"] at Sketch Engine or CQPWeb. But with an LLM, you could simply ask "what nouns occur the most right after dark"?

Unfortunately, LLM are not good at generating linguistic data. For example, look at a comparison of corpus data and LLM predictions for each of the four preceding questions by clicking on the link at the end of the line (and then go to the indicated section on that page). The predictions from the LLMs don't match up well with actual corpus data. So unfortunately, no, we can't just ask a question like "what nouns occur the most right after dark", and expect to have the predictions of LLMs represent actual language very well.


While LLMs are bad at generating linguistic data, they are better at analyzing linguistic data. For example, they would be fairly good at the following (click on the link and go to the indicated section to see examples)

  1. (word frequency) given a list of words matching a certain pattern, rank them by frequency (Section #2 here)

  2. (phrase frequency) given a list of nouns, decide which occur the most after the word dark (Section #2 here)

  3. (genres) given a list of words, figure out in which genre they are the most common (Section #1 here)

  4. (historical) given a particular syntactic construction, decide whether it is increasing or decreasing in frequency over time (Section #7 here)

  5. (dialects) given a list of words, decide in which of 5-10 countries they would be the most frequent (Section #2 here)

Their predictions about relative frequency and distribution definitely wouldn't be as good as actual corpus data. But LLMs can analyze and explain the corpus data quite well -- for example, why are certain words in a corpus more frequent in academic, or why have certain syntactic constructions increased over time. But crucially, all of these tasks first require that we first give the LLM some linguistic data (such as corpus data) and only then can the LLM do a good job analyzing and explaining the data.

This is why at English-Corpora.org, we don't use the corpora to generate linguistic data (except for certain cases, where it is carefully verified with corpus data). But we do use LLMs to analyze and explain the data that has been generated by the corpus.


In summary, language learners, teachers, and researchers who want reliable data should not try to use LLMs (AI) to generate lists -- such as words with *break*, or the most frequent phrases for dark NOUN or NOUN market. But they can use AI (LLMs) to analyze linguistic data, such as the data that they get from a corpus.

Crucially, though, this assumes that they are using the right corpora. And currently, this AI functionality (including the ability to see the analyses in their own language) is only possible at English-Corpora.org.