Corpora AI / LLMs Overview PDF Videos Resources Help / FAQ My account † corpus.byu.edu

English-Corpora.org

The following is a history of the different corpora, as well as changes and improvements to the corpus architecture and interface.

2025	Added extensive integration of AI features
2024	Updated and expanded help files
2024	Variable length ("flex") queries
2021-2024	Added several detailed instructional videos: overview, language learning and teaching, word sketches, browsing words, analyze texts, search history, customized word lists, saved words (favorites), KWIC lines: limiting and sorting, saved KWIC lines, analyze KWIC lines, external resources, Virtual Corpora, examining recent change.
2021-2024	Added several detailed PDF help files: overview / guided tour, architecture, association measures, collocates (cf Sketch Engine), topics (and collocates), word sketches, browsing words, analyzing texts, KWIC -> analyze text, saved words and phrases, saving KWIC entries, customized word lists, search history, external resources, monitor corpus, Virtual Corpora, Virtual Corpora: quick overview
2021-2024	Added additional functionality to corpora: ability to see the number of texts (in addition to frequency), case sensitive search, etc. Also, many improvements in terms of speed (although the corpora were already the fastest structured corpora in the world -- 5-10 times as fast as other corpora like Sketch Engine.
2021-2024	Added more than 8 billion words of new data for the NOW Corpus
2021. Jun	Integrated the Academic Vocabulary List into COCA
2020. Aug	COCA now allows users to analyze entire texts (e.g. student compositions or online newspapers), and then see detailed data from COCA for the words and phrases in their text.
2020. May	Released the Coronavirus Corpus, which currently contains billion words of data, and which is growing by 80-100 million words each month.
2020. Apr	The frequency-based data from all of the corpora is now linked to a wide range of external resources, including searches of the web, images, videos, books, and translations
2020. Mar	Released new (and final) version of COCA. One billion words; nearly twice the size as before; texts through Dec 2019; new genres (web, blog, TV/movies), and many new word-oriented features (like iWeb)
2019. Mar	Moved the corpora off-campus (from corpus.byu.edu to www.english-corpora.org) in order to ensure their long-term survival. The corpora have the same functionality as before, and in fact they are even a little bit faster at this new website.
2019. Feb	TV corpus: 325 million words in 75,000 very informal episodes (e.g. comedies and dramas) from 1950-2018. Movie corpus: 200 million words in 25,000 movies from 1930-2018. By far the most informal of all of the corpora from English-Corpora.org.
2018. May	14 billion word iWeb ("Intelligent Web") corpus. Unlike other large corpora of English, this one allows much more intelligent website-based searches, as well as in-depth information on the top 60,000 words in the corpus.
2017. Oct	Released the Early English Books Online (EEBO) corpus, which contains 755 million words in more than 25,000 texts from the 1470s to the 1690s.
2017. Sep	All of the corpora and the corpus portal (as well as corpus-based resources) now available with secure HTTPS connection
2017. Feb	Released the US Supreme Court corpus, which contains 130 million words in US Supreme Court opinions during the last 200 years.
2016. May	Released a major update to the corpus interface, which works great on mobile devices and which allows the use of "virtual corpora"
2016. May	Released the NOW corpus, which automatically adds about 180-200 million words of data every month.
2016. May	Released the CORE corpus, which is the first corpus of web pages (about 50 million words of data) that are carefully tagged for register (personal blog, advice, interviews, etc)
2015. Jul	Released the Hansard corpus, which is based on 1.6 billion words in 7.6 million speeches from the British Parliament, 1803-2005.
2015. Jan	Released the Wikipedia corpus, which is based on 1.9 billion words in 4.4 million articles from Wikipedia.
2014. Mar	Released full-text versions of COCA and GloWbE, which allow users to search the downloaded texts on their own computer
2013. Aug	Released www.academicvocabulary.info; free downloadable lists for academic English: word families, core academic, and genre-specific technical words
2013. Aug	Released www.wordandphrase.info/academic: same interface as the WordAndPhrase resources below, but for just for COCA-Academic
2013. Apr	Released the Corpus of Global Web-Based English (GloWbE) (1.9 billion words, 2012-13)
2013. Jan	Released the Strathy Corpus (Canadian English) (50 million words, ~1970s-2000s)
2012. Aug	Created ability to compare results from different corpora (side by side) within the web interface, e.g. COCA and BNC
2012. Aug	Update the British National Corpus with the CLAWS 7 tagset; inclusion of speech indicators, XML World Edition
2012. Jul	Released the Corpus of American Soap Operas (100 million words, 2001-2012)
2012. Jul	Added the following datasets to the Google Books corpora: British English (34 billion words), Fiction (91 billion), One Million Books (89 billion), Spanish (45 billion)
2012. Jun	Added about 25 million words to the Corpus of Contemporary American English (COCA), for Apr 2011 - Jun 2011.
2012. Feb	Modified www.wordandphrase.info: ability to enter entire texts and then see detailed information about words and phrases
2012. Jan	Released www.wordandphrase.info: integrated frequency and genre data, definitions, collocates, concordances, synonyms, and WordNet
2011. Dec	Released free n-grams lists for COCA and COHA; millions of rows of data for 2-grams (two word sequences), 3-grams, 4-grams, and 5-grams.
2011. May	Released beta version of the Google Books (American English) Corpus (155 billion words, 1810-2009)
2011. Apr	Added about 15 million words to the Corpus of Contemporary American English (COCA), for July 2010 - Mar 2011.
2011. Feb	Added concordance view
2010. Oct	Improved functionality for interaction with other users (see queries, researchers, publications) and ability to save and manipulate Keyword in Context entries.
2010. Sep	Released beta version of the Corpus of Historical American English (COHA)
2010. Aug	Added about 20 million words to the Corpus of Contemporary American English (COCA), for July 2009 - June 2010.
2010. Feb	Released the frequency lists and dictionary that are based on the Corpus of Contemporary American English.
2009. Aug	Added about 15 million words to the Corpus of Contemporary American English (COCA), for October 2008 - June 2009.
2009. May	Added new tools for collaboration: links to previous queries (including annotations/notes) and ability to share them with others
2008. Oct	Added about 15 million words to the Corpus of Contemporary American English (COCA), for Jan-Sep 2008.
2008. Jun	Applied the new architecture to the Corpus do Português
2008. Apr	Applied the new architecture to the British National Corpus and the TIME Corpus
2008. Mar	Released the Corpus of Contemporary American English
2007. Oct	Finished new (current) corpus architecture; applied it to the Corpus del Español. Major updates in this corpus as well, including much-improved tagging and lemmatization for Modern Spanish.
2007. May	Released the TIME Corpus of American English
2006. Aug	Released the Corpus do Português
2005. Apr	Interface for Register Variation in Spanish
2004. Apr	Released VIEW, our first version of the British National Corpus
2002. Sep	Released the first version of the Corpus del Español