|
| Why not just use AI instead of corpora? |
|
Mark Davies (2026)
Why not just use AI (via LLMs: Large Language
Models) to get language data, instead of a well-designed corpus? For example,
why not just ask ChatGPT or Gemini or Claude to generate the following
linguistic data for English:
-
(word frequency)
find the most frequent words that match certain patterns, such as *break*
(Section #5
here)
-
(phrase frequency)
find nouns that occur the most after the word dark
(#5
here)
-
(genres) find
words that are more common in fiction than academic, or vice versa
(Section #3
here)
-
(historical) find
adjectives that were more common in the 1800s than in the 1980s-2010s
(Section #6
here)
This would be much easier than learning to
use a corpus. With a corpus you need to learn the search syntax -- for example
(for #2 above)
dark NOUN
at English-Corpora.org (click that link to see the results from COCA), or
something much more complicated like [word="dark"]
[tag="N.*"] at Sketch Engine
or CQPWeb. But with an LLM, you could
simply ask "what nouns occur the most right after dark"?
Unfortunately, LLM are not good at generating
linguistic data. For example, look at a comparison of corpus data and LLM
predictions for each of the four preceding questions by clicking on the link at
the end of the line (and then go to the indicated section on that page). The
predictions from the LLMs don't match up well with actual corpus data. So
unfortunately, no, we can't just ask a question like "what nouns occur the most
right after dark", and expect to have the predictions of LLMs represent
actual language very well.
While LLMs are bad at generating
linguistic data, they are better at analyzing linguistic data.
For example, they would be fairly good at the following (click on the link and
go to the indicated section to see examples)
-
(word frequency)
given a list of words matching a certain pattern, rank them by
frequency (Section #2
here)
-
(phrase frequency)
given a list of nouns, decide which occur the most after the word
dark (Section #2
here)
-
(genres) given
a list of words, figure out in which genre they are the most common
(Section #1
here)
-
(historical)
given a particular syntactic construction, decide whether it is
increasing or decreasing in frequency over time
(Section #7
here)
-
(dialects)
given a list of words, decide in which of 5-10 countries they would be
the most frequent (Section #2
here)
Their predictions about relative frequency and
distribution definitely wouldn't be as good as actual corpus data. But LLMs can
analyze and explain the corpus data quite well -- for example, why
are certain words in a corpus more frequent in academic, or why have
certain syntactic constructions increased over time. But crucially, all of these
tasks first require that we first give the LLM some linguistic data (such
as corpus data) and only then can the LLM do a good job analyzing and
explaining the data.
This is why at English-Corpora.org, we don't use
the corpora to generate linguistic data (except for
certain cases, where it is carefully verified with corpus data). But
we do use LLMs to analyze and explain the data that has been
generated by the corpus.
In summary, language learners, teachers, and
researchers who want reliable data should not try to use LLMs (AI) to
generate lists -- such as words with *break*, or the most
frequent phrases for dark NOUN or NOUN market. But they can
use AI (LLMs) to analyze linguistic data, such as the data that
they get from a corpus.
Crucially, though, this assumes that they are using
the right corpora. And currently, this
AI functionality (including the ability to see the analyses in their
own language) is only possible at English-Corpora.org.
|