The following are some words that appeared much more in newspapers and magazines in the United States 2016 than in 2010-2015. Most of these aren't surprises, which is actually good. It shows that with the right corpus and the right methodology (see below), it provides data that really do reflect what was happening in the US during 2016, and what words appeared more in online newspapers and magazines. Similar data is of course available for the other 19 countries in the NOW corpus.


Politics: president-elect, Donald + Trump, (Mike) Pence, presumptive (j), (James) Comey, Hillary + Clinton, nominee (n), Bernie Sanders, delegate (n), battleground (n), primary (n), candidacy (n), electoral (j), front-runner (n), (Ted) Cruz, electorate (n), rally (n), presidential (j), election, convention, matchup (n), presidency (n), polling (n), nomination (n), convention (n), Marco (Rubio), Democratic, voter (n), campaign (v), ballot (n), election (n), Cleveland, Republican, (New) Hampshire, supporter (n), candidate (n), voting (n), (House) Speaker, strategist (n), campaign (n), vote (n), endorse (v), poll (n), endorsement (n), rally (v), trail (v), vote (v), (Paul) Ryan, (campaign) trail (n), swing (n) (state), race (n), contender (n), elect (v), Democrat, (Mitt) Romney, (Chris) Christie. (Comment: 2016 was an election year in the US, which is undoubtedly related to the high frequency of many of these words, compared to 2010-2015)

Probably related to political issues: referendum (n), deport (v), backer (n), protester (n), (North) Carolina, chant (v), (Hillary Clinton's) server (n)

Probably related in many cases to Donald Trump: alt-right, anti-trump, dislike (v), mogul (n), racist (n), billionaire (n), racism (n), ego (n), racist (j), hate (n), denounce (v). (Comment: there are very few (negative) words that relate almost exclusively to Hillary Clinton in these sources, which is interesting)

Crime / terrorism: Black + Lives + Matter, Isis, Orlando, fatally (r), nightclub (n), homicide (n), firefighter (n), gunfire (n), shooting (n), gunman (n), wound (v), terrorism (n), aggravated (j), gunshot (n), final (n), addict (n), injure (v), suspect (n), shooter (n). (Comment: apparently (and sadly) an especially violent year)

Sports: Rio de Janeiro, innings (n), inning (n), interception (n), touchdown (n), rookie (n), offseason (n), tackle (n), postseason (n), quarterback (n), hitter (n), linebacker (n), receiver (n), opener (n), starter (n), playoff (n), Olympic, defensive (j), rebound (n), streak (n), pitcher (n), Olympics, roster (n). (Comments: it is strange that there are so many sports words, especially when the "sources" stayed roughly the same from 2010-2015 to 2016 (see #2 below)).

Sports and/or politics: win (n), defeat (v), victory (n), running (j) (running back = sports, running mate = politics)

Others (some interesting words here): Zika (virus), gritty (j), defining (j), avid (j), allergy (n), attribute (n), authenticity (n), tweet (n), rhetoric (n), generic (j), serial (j), depiction (n), proudly (r), vivid (j), tweet (v), (Game of) Thrones, blockbuster (n), contemplate (v), selfie, Instagram, Cuba, militant (n), Ohio, Russian, pledge (v), selfies, passionate (j), Latino, Netflix. (Some of these make sense to us; others not so much)
 


Methodology

(Note: we refer to "words" below, but in many cases we are actually talking about what linguists call "lemmas". The words run, runs, ran, and running all belong to the verbal lemma run. You would expect to find the lemma run in a dictionary, but not ran or runs. In the spreadsheet with the data, most of the entries are for lemmas (the sheet [LemPoS]). Some words don't have an assigned lemma (e.g. selfie) or they are capitalized (probably a proper noun, e.g. Trump; wouldn't want to combine with the verb trump) or are actually tagged for part of speech as a proper noun (e.g. Clinton). These are in the sheet [capsNPNoLem]), and they are treated as words, not lemmas.)

1. We first just counted the frequency of all words from 2010-2015 and then compared these to the frequency in 2016. The problem here was that Google News (the basis for the corpus) provides links from different sources (newspapers and magazines) over time. So if in 2016 they started including links from an online magazine X related to Topic Y (which hadn't really been covered well in sources from 2010-2015) then it would look like words related to Topic Y increased in frequency overall.

2. Therefore we created a "sub-corpus" of 30 sources (such as the Washington Post, Chicago Tribune, Salon.com, ESPN.com), where the total number of articles in 2010-2015 and 2016 didn't vary too much. About 31% of all of the texts for the US in the corpus come from 2016 (273,271 texts for 2016; 869,467 texts for 2010-2016). We chose those sources where the number of texts for 2016 compared to 2010-16 was between 50% and 200% of the overall 31% (all sources) figure. This created a corpus with about 25 million words (51,000 texts) from 2016 and about 95 million words overall (157,000 texts) from 2010-2016.

3. We wanted to ensure that the words weren't limited to just a few of these sources, so for each word we used the Juilland d dispersion measure to see how "equally" the tokens for the words were distributed across all 30 sources, and we eliminated any words with a measure of less than about 0.75.

4. We looked at high frequency phrases with each word, to eliminate words that occurred in highly frequent "boilerplate" phrases in a particular source (e.g. ...this article cannot be redistributed without prior consent...).

5. We then compared the number of texts with the word in the US, both in 2016 and overall in 2010-2016, and ranked the words by the ratio in these two periods (see full data). The word needs to occur in a minimum of about 300 texts in 2016.

To take a concrete example (from the full data), nominee occurs in 2,720 texts from 2016 and 3,555 texts from 2010-2016, which means that 77% of all of the texts are from 2016. As #2 above shows, about 31% of all texts in the US corpus (in our sub-corpus of 30 sources) are from 2016, and therefore for any word we would expect about 31% of the texts to be from 2016. So 0.77 is 2.43 times more than the expected figure of 0.31, and that number (2.43) is what is used to find "more frequent" words from 2016. The words are ranked by this ratio (for each of the categories above).