Mark Davies / September 2025
English-Corpora.org now offers something
entirely new: the ability to combine the depth and reliability of corpus
data with the analytic power of Large Language Models (LLMs) like GPT,
Gemini, Claude, Perplexity, Llama, Mistral, and DeepSeek.
With just one click, the corpus can send
collocates, frequency patterns, phrase lists, or concordance lines to an
LLM — which will instantly group, explain, and interpret the data. These
AI-powered insights appear directly in the interface, alongside the
original corpus results.
The result? Faster understanding of
patterns, clearer semantic groupings, and deeper insight into how
language works — for language learners and researchers alike. The corpus
data remains front and center, but now with the option of an intelligent
assistant working behind the scenes to guide your analysis.
Corpora first. AI second. Insights
faster. |
Background: In March 2025 I released seven detailed studies (with a total of 89
pages of discussion and hundreds of examples) discussing how well the
predictions of two LLMs -- GPT-4o (from OpenAI) and Gemini 1.5 Pro (from Google)
-- match the actual data from large, well-known, publicly-accessible corpora
(like those from English-Corpora.org). The seven detailed studies are found
here:
word frequency,
phrase frequency,
collocates,
comparing words (via collocates),
genre-based variation,
historical variation, and dialectal
variation. You can also see a summary of
these findings, as well as a
video on YouTube.
Soon after I finished this comparison,
however, I thought -- maybe it's not a question of "either/or".
Maybe it's a issue of "and/with". Why not take the strengths of AI /
LLMs, and integrate them right into the corpus interface? As the
comparison between corpora and AI/LLMs
indicate, what LLM are really good at is classifying and
explaining data. And that's why this insight is being directly
integrated into English-Corpora.org.
The following are some of the ways that the AI/LLM
insights have been integrated with corpus data. Pay close attention to the
categorization and especially the analysis from the LLMs. None of the analyses
that you see on any of these pages are human-generated; they all come from the
LLMs. I think you'll agree that this insight from LLMs will completely transform
the way that people interact with the corpora, especially for non-native
speakers and language learners.
(Of course, while the AI analyses offer powerful
insights, users should remember that they represent intelligent suggestions
based on patterns in the data, and at times they might not be 100% accurate or
even very insightful. If you demand absolute perfection and accuracy, then LLMs might not
be for you.)
Task / function |
Examples / discussion |
Video |
PDF |
Introduction: Integrating LLM insights and corpus data |
The mechanics of integrating LLM insights into corpus
data |
 |
 |
Classifying and categorizing collocates |
Collocates of cap |
 |
|
Classifying and categorizing collocates (COCA, iWeb) |
Collocates of bow
|
 |
|
Classifying and categorizing phrases |
soft NOUN |
 |
|
Comparing two words (via collocates) |
Quandary vs predicament, provoke vs
incite |
 |
 |
Comparing two genres, time periods, and dialects
(lists) |
chain + NOUN (fic / acad), ADJ women
(1800s / now), ADJ scheme (US / UK) |
 |
 |
Comparing genres, time periods, and dialects (all
sections) |
soft NOUN (genres), ADJ food
(historical), *ism (dialects) |
 |
 |
Comparing genres, time periods, and dialects (charts) |
"like construction" (genres), need NEG
(historical),
soft day (dialects) |
 |
 |
Analyzing KWIC/concordance lines |
fathom, naked eye (collocations, prosody,
syntactic, pragmatic, domains) |
 |
 |
Generating words and phrases for topics and concepts |
climate change, famous actresses, female jobs in 1800s,
Ger es tut mir leid |
 |
 |
Generating words and phrases via translations |
German sowohl alt als jung, Russian финансовое
состояние, Japanese 중요한 사안 |
 |
 |
Generating words and phrases: find "more natural"
phrases |
make a photo (perhaps from Japanese 写真を撮る ),
pleasing scenery, tough idea |
 |
|
Conclusion |
|
 |
|
Purchase credits |
Purchase credits; see detailed balance sheet; use own
API key (if desired) |
 |
 |
Select LLMs |
Re-do any query immediately in another LLM; specify
default LLMs |
 |
 |
Choose display language |
Choose to display results (and classifications) in any
one of 20 different languages |
 |
 |
Create AI profile |
Choose from among 14 different "profiles" in order to
get more targeted results |
 |
 |
Save and retrieve results |
Save any results (analysis or classification); search
through and re-do past queries |
 |
 |
|