Mark Davies / May 2025

English-Corpora.org will soon offer
something entirely new: the ability to combine the depth and reliability
of corpus data with the analytic power of Large Language Models (LLMs)
like GPT, Gemini, Claude, Grok, Llama, and DeepSeek.
With just one click, the corpus will send
collocates, frequency patterns, phrase lists, or concordance lines to an LLM — which will instantly group, explain, and interpret the data. These
AI-powered insights appear directly in the interface, alongside the
original corpus results.
The result? Faster understanding of
patterns, clearer semantic groupings, and deeper insight into how
language works — for language learners and researchers alike. The corpus
data remains front and center, but now with the option of an intelligent
assistant working behind the scenes to guide your analysis.
Corpora first. AI second. Insights
faster. |
Background: In March 2025 we released seven detailed studies (with a total of 89
pages of discussion and examples) discussing how well the
predictions of two LLMs -- GPT-4o (from OpenAI) and Gemini 1.5 Pro
(from Google) -- match the actual data from large, well-known,
publicly-accessible corpora (like those from English-Corpora.org). The seven
detailed studies are found here:
word frequency,
phrase frequency,
collocates,
comparing words (via collocates),
genre-based variation,
historical variation, and dialectal
variation. You can also see a summary of
these findings, as well as a
video on YouTube.
Soon after we finished this comparison,
however, we thought -- maybe it's not a question of "either/or".
Maybe it's a issue of "and/with". Why not take the strengths of AI /
LLMs, and integrate them right into the corpus interface? As the
comparison between corpora and AI/LLMs
indicate, what LLM are really good at is classifying and
explaining data. And that's why we're integrating that insight into
English-Corpora.org.
The following are some of the ways that we're
integrating AI/LLM insights with corpus data. Pay close attention to the
categorization and
especially the analysis from the LLMs. None of the analyses that you see on any
of these pages are human-generated; they all come from the LLMs. We think you'll
agree that this insight from LLMs will completely transform the way that people
interact with the corpora, especially for non-native speakers and language
learners.
(Of course, while the AI analyses offer powerful
insights, users should remember that they represent intelligent suggestions
based on patterns in the data, and they are not 100% accurate linguistic
conclusions. If you demand absolute perfection, then LLMs might not be for you.)
Note that we already have this functionality
working in in the corpora; we will release it for "beta testing" in June
(and you can volunteer to help with that),
and then for public release in July 2025.
Task / function |
Examples / discussion
(note: there are more examples in the video) |
Video |
PDF |
Introduction: Integrating LLM insights and corpus data |
The mechanics of integrating LLM insights into corpus
data |
 |
 |
Classifying and categorizing collocates and related
phrases |
Collocates of cap, collocates of bow
(COCA, iWeb), soft NOUN |
 |
|
Comparing two words (via collocates) |
Quandary vs predicament, provoke vs
incite, fleeting vs ephemeral |
 |
 |
Comparing two genres, time periods, and dialects
(lists) |
chain + NOUN (fic / acad), ADJ women
(1800s / now), ADJ scheme (US / UK) |
 |
 |
Comparing genres, time periods, and dialects (all
sections) |
soft NOUN (genres), ADJ food (historical), *ism
(dialects) |
 |
 |
Comparing genres, time periods, and dialects (charts) |
"like construction" (genres), need NEG (historical),
soft day (dialects) |
 |
 |
Analyzing KWIC/concordance lines |
fathom, naked eye (collocations, prosody,
syntactic, pragmatic, domains) |
 |
 |
Generating words and phrases for topics and concepts |
astronomy, COVID-19, female jobs in 1800s, "I'm tired",
Sp manguera, Ger es tut mir leid |
 |
 |
Conclusion |
|
 |
|
|