No, Online Grammar Errors Have Not Increased by 148%

Yesterday a post appeared on QuickandDirtyTips.com (home of Grammar Girl’s popular podcast) that appears to have been written by a company called Knowingly, which is promoting its Correctica grammar-checking tool. They claim that “online grammar errors have increased by 148% in nine years”. If true, it would be a pretty shocking claim, but the numbers immediately sent up some red flags.

They searched for seventeen different errors and compared the numbers from nine years ago to the numbers from today. From the description, I gather that the first set of numbers comes from a publicly available set of data that Google culled from public web pages. The data was released in 2006 and is hosted by the Linguistic Data Consortium. You can read more about the data here, but this part is the most relevant:

We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.

So the data is taken from over a trillion words of text, but some sequences were discarded if they didn’t appear frequently enough, and you can only search sequences up to five words long. Also note that while the data was released in 2006, it does not necessarily all come from 2006; some of it could have come from web pages that were older than that.

It sounds like the second set of numbers comes from a series of Google searches—it simply says “search result data today”. It isn’t explicitly stated, but it appears that the search terms were put in quotes to find exact strings. But we’re already comparing apples and oranges: though the first set of data came from a known sample size (just over a trillion words) and and was cleaned up a bit by having outliers thrown out, we have no idea how big the second sample size is. How many words are you effectively searching when you do a search in Google?

This is why corpora usually present not just raw numbers but normalized numbers—that is, not just an overall count, but a count per thousand words or something similar. Knowing that you have 500 instances of something in data set A and 1000 instances in data set B doesn’t mean anything unless you know how big those sets are, and in this case we don’t.

This problem is ameliorated somewhat by looking not just at the raw numbers but at the error rates. That is, they searched for both the correct and incorrect forms of each item, calculated how frequent the erroneous form was, and compared the rates from 2006 to the rates from 2015. It would still be better to compare two similar datasets, because we have no idea how different the cleaned-up Google Ngrams data is from raw Google search data, but at least this allows us to make some rough comparisons. But notice the huge differences between the “then” and “now” numbers in the table below. Obviously the 2015 data represents a much larger set. (I’ve split their table into two pieces, one for the correct terms and one for the incorrect terms, to make them fit in my column here.)

Correct Term

Then

Now

jugular vein

56,409

794,000

bear in mind

931,235

35,500,000

head over heels

179,491

8,130,000

chocolate mousse

237,870

6,790,000

egg yolk

152,458

5,420,000

without further ado

120,124

1,960,000

whet your appetite

52,850

533,000

heroin and morphine

3,220

112,000

reach across the aisle

2707

117,000

herd mentality

19,444

411,000

weather vane

70906

477,000

zombie horde

21,091

464,000

chili peppers

1,105,405

29,100,000

brake pedal

138,765

1,450,000

pique your interest

8,126

296,000

lessen the burden

14,926

389,000

bridal shower

852,371

16,500,000

Incorrect Term

Then

Now

juggler vein

693

4,150

bare in mind

18,492

477,000

head over heals

12,633

398,000

chocolate moose

14,504

364,000

egg yoke

2,028

88,900

without further adieu

13,170

437,000

wet your appetite

8,930

216,000

heroine and morphine

45

3,860

reach across the isle

93

11,800

heard mentality

313

21,300

weather vein

698

16,100

zombie hoard

744

64,200

chilly peppers

2,532

155,000

brake petal

417

27,800

peek your interest

320

111,000

lesson the burden

212

91,400

bridle shower

182

157,000

But then the Correctica team commits a really major statistical goof—they average all those percentages together to calculate an overall percentage. Here’s their data again:

Incorrect Term

Then

Now

Increase

juggler vein

1.2%

0.5%

–57.2%

bare in mind

1.9%

1.3%

–31.9%

head over heals

6.6%

4.7%

–29.0%

chocolate moose

5.7%

5.1%

–11.5%

egg yoke

1.3%

1.6%

22.9%

without further adieu

9.9%

18.2%

84.5%

wet your appetite

14.5%

28.8%

99.5%

heroine and morphine

1.4%

3.3%

141.7%

reach across the isle

3.3%

9.2%

175.8%

heard mentality

1.6%

4.9%

211.0%

weather vein

1.0%

3.3%

234.9%

zombie hoard

3.4%

12.2%

256.7%

chilly peppers

0.2%

0.5%

131.8%

brake petal

0.3%

1.9%

527.9%

peek your interest

3.8%

27.3%

619.8%

lesson the burden

1.4%

19.0%

1258.6%

bridle shower

0.0%

0.9%

4315.2%

3.4%

8.4%

148.2%

They simply add up all the percentages (1.2% + 1.9% + 6.6% + . . .) and divide by the numbers of percentages, 17. But this number is meaningless. Imagine that we were comparing two items: isn’t is used 9,900 times and ain’t 100 times, and regardless is used 999 times and irregardless 1 time. This means that when there’s a choice between isn’t and ain’t, ain’t is used 1% of the time (100/(9900+100)), and when there’s a choice between regardless and irregardless, irregardless is used .1% of the time (1/(999+1)). If you average 1% and .1%, you get .55%, but this isn’t the overall error rate.

But to get an overall error rate, you need to calculate the percentage from the totals. We have to take the total number of errors and the total number of opportunities to use either the correct or the incorrect form. This gives us (1+100/((9900+999)+(100+1))), or 101/11000, which works out to .92%, not .55%.

When we count up the totals and calculate the overall rates, we get an error rate of 1.88% for then (not 3.4%) and 2.38% for now (not 8.4%). That means the increase from 2006 to 2009 is not 148.2%, but a much more modest 26.64%. (By the way, I’m not sure where they got 148.2%; by my calculations, it should be 147.1%, but I could have made a mistake somewhere.) This is still a rather impressive increase in errors from 2009 to today, but the problems with the data set make it impossible to say for sure if this number is accurate or meaningful. “Heroine and morphine” occurred 45 times out of over a trillion words. Even if the error rate jumped 141.73% from 2009 to 2015, and even if the two sample sets were comparable, this would still probably amount to nothing more than statistical noise.

And even if these numbers were accurate and meaningful, there’s still the question of research design. They claim that grammar errors have increased, but all of the items are spelling errors, and most of them are rather obscure ones at that. At best, this study only tells us that these errors have increased that much, not that grammar errors in general have increased that much. If you’re setting out to study grammar errors (using grammar in the broad sense), why would you assume that these items are representative of the phenomenon in general?

So in sum, the study is completely bogus, and it’s obviously nothing more than an attempt to sell yet another grammar-checking service. Is it important to check your writing for errors? Sure. Can Correctica help you do that? I have no idea. But I do know that this study doesn’t show an epidemic of grammar errors as it claims to.

(Here’s the data if anyone’s interested.)