Corpora Overview PDF Videos Resources Help / FAQ My account

English-Corpora.org

Disclaimer for Advanced Users (See disclaimer for "general" users)

A number of analyses on this site -- including collocate grouping, pattern classification, and descriptive summaries -- incorporate the use of large language models (LLMs) such as GPT, Gemini, or Claude. These systems have been integrated not to replace expert analysis, but to offer approximate, accessible overviews grounded in corpus data.

We recognize and address several common concerns:

Intended Audience and Scope: The LLM-generated insights are primarily intended for non-native speakers and language learners, who often lack the tools or training to independently extract patterns from corpus output. Expecting expert-level linguistic analysis -- particularly within the 250–300 word constraints imposed by most of the API prompts at this site -- sets an unrealistic benchmark for what these tools are designed to achieve.
Complementary Function: These outputs function as interpretive scaffolds -- not definitive claims, but preliminary summaries that help users make sense of unfamiliar data. Advanced users are encouraged to treat them as heuristic entry points, not as substitutes for close corpus-based investigation.
On Criticism and Scholarly Standards: It is methodologically unsound to highlight only the subset of LLM outputs that are weak or inaccurate in order to discredit the tool as a whole. A more rigorous evaluation involves selecting a representative or randomized sample and reporting meaningful performance metrics -- e.g., the percentage of responses that are insightful, vague, or misleading.
Constructive Evaluation Welcome: We welcome empirical critiques and suggestions for improvement -- especially when these are grounded in an understanding of corpus methodologies and realistic expectations for LLM output.

In short: the LLM-based analyses are useful in context, especially for helping learners engage with linguistic data. For expert users, they are meant to complement (not compete with) traditional corpus analysis, offering efficient approximations that can be further refined through human expertise.