No, Online Grammar Errors Have Not Increased by 148%

March 5, 2015

No, Online Grammar Errors Have Not Increased by 148%

Yesterday a post appeared on QuickandDirtyTips.com (home of Grammar Girl’s popular podcast) that appears to have been written by a company called Knowingly, which is promoting its Correctica grammar-checking tool. They claim that “online grammar errors have increased by 148% in nine years”. If true, it would be a pretty shocking claim, but the numbers immediately sent up some red flags.

They searched for seventeen different errors and compared the numbers from nine years ago to the numbers from today. From the description, I gather that the first set of numbers comes from a publicly available set of data that Google culled from public web pages. The data was released in 2006 and is hosted by the Linguistic Data Consortium. You can read more about the data here, but this part is the most relevant:

We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.

So the data is taken from over a trillion words of text, but some sequences were discarded if they didn’t appear frequently enough, and you can only search sequences up to five words long. Also note that while the data was released in 2006, it does not necessarily all come from 2006; some of it could have come from web pages that were older than that.

It sounds like the second set of numbers comes from a series of Google searches—it simply says “search result data today”. It isn’t explicitly stated, but it appears that the search terms were put in quotes to find exact strings. But we’re already comparing apples and oranges: though the first set of data came from a known sample size (just over a trillion words) and and was cleaned up a bit by having outliers thrown out, we have no idea how big the second sample size is. How many words are you effectively searching when you do a search in Google?

This is why corpora usually present not just raw numbers but normalized numbers—that is, not just an overall count, but a count per thousand words or something similar. Knowing that you have 500 instances of something in data set A and 1000 instances in data set B doesn’t mean anything unless you know how big those sets are, and in this case we don’t.

This problem is ameliorated somewhat by looking not just at the raw numbers but at the error rates. That is, they searched for both the correct and incorrect forms of each item, calculated how frequent the erroneous form was, and compared the rates from 2006 to the rates from 2015. It would still be better to compare two similar datasets, because we have no idea how different the cleaned-up Google Ngrams data is from raw Google search data, but at least this allows us to make some rough comparisons. But notice the huge differences between the “then” and “now” numbers in the table below. Obviously the 2015 data represents a much larger set. (I’ve split their table into two pieces, one for the correct terms and one for the incorrect terms, to make them fit in my column here.)

Correct Term	Then	Now
jugular vein	56,409	794,000
bear in mind	931,235	35,500,000
head over heels	179,491	8,130,000
chocolate mousse	237,870	6,790,000
egg yolk	152,458	5,420,000
without further ado	120,124	1,960,000
whet your appetite	52,850	533,000
heroin and morphine	3,220	112,000
reach across the aisle	2707	117,000
herd mentality	19,444	411,000
weather vane	70906	477,000
zombie horde	21,091	464,000
chili peppers	1,105,405	29,100,000
brake pedal	138,765	1,450,000
pique your interest	8,126	296,000
lessen the burden	14,926	389,000
bridal shower	852,371	16,500,000

Incorrect Term	Then	Now
juggler vein	693	4,150
bare in mind	18,492	477,000
head over heals	12,633	398,000
chocolate moose	14,504	364,000
egg yoke	2,028	88,900
without further adieu	13,170	437,000
wet your appetite	8,930	216,000
heroine and morphine	45	3,860
reach across the isle	93	11,800
heard mentality	313	21,300
weather vein	698	16,100
zombie hoard	744	64,200
chilly peppers	2,532	155,000
brake petal	417	27,800
peek your interest	320	111,000
lesson the burden	212	91,400
bridle shower	182	157,000

But then the Correctica team commits a really major statistical goof—they average all those percentages together to calculate an overall percentage. Here’s their data again:

Incorrect Term	Then	Now	Increase
juggler vein	1.2%	0.5%	–57.2%
bare in mind	1.9%	1.3%	–31.9%
head over heals	6.6%	4.7%	–29.0%
chocolate moose	5.7%	5.1%	–11.5%
egg yoke	1.3%	1.6%	22.9%
without further adieu	9.9%	18.2%	84.5%
wet your appetite	14.5%	28.8%	99.5%
heroine and morphine	1.4%	3.3%	141.7%
reach across the isle	3.3%	9.2%	175.8%
heard mentality	1.6%	4.9%	211.0%
weather vein	1.0%	3.3%	234.9%
zombie hoard	3.4%	12.2%	256.7%
chilly peppers	0.2%	0.5%	131.8%
brake petal	0.3%	1.9%	527.9%
peek your interest	3.8%	27.3%	619.8%
lesson the burden	1.4%	19.0%	1258.6%
bridle shower	0.0%	0.9%	4315.2%
	3.4%	8.4%	148.2%

They simply add up all the percentages (1.2% + 1.9% + 6.6% + . . .) and divide by the numbers of percentages, 17. But this number is meaningless. Imagine that we were comparing two items: isn’t is used 9,900 times and ain’t 100 times, and regardless is used 999 times and irregardless 1 time. This means that when there’s a choice between isn’t and ain’t, ain’t is used 1% of the time (100/(9900+100)), and when there’s a choice between regardless and irregardless, irregardless is used .1% of the time (1/(999+1)). If you average 1% and .1%, you get .55%, but this isn’t the overall error rate.

But to get an overall error rate, you need to calculate the percentage from the totals. We have to take the total number of errors and the total number of opportunities to use either the correct or the incorrect form. This gives us (1+100/((9900+999)+(100+1))), or 101/11000, which works out to .92%, not .55%.

When we count up the totals and calculate the overall rates, we get an error rate of 1.88% for then (not 3.4%) and 2.38% for now (not 8.4%). That means the increase from 2006 to 2009 is not 148.2%, but a much more modest 26.64%. (By the way, I’m not sure where they got 148.2%; by my calculations, it should be 147.1%, but I could have made a mistake somewhere.) This is still a rather impressive increase in errors from 2009 to today, but the problems with the data set make it impossible to say for sure if this number is accurate or meaningful. “Heroine and morphine” occurred 45 times out of over a trillion words. Even if the error rate jumped 141.73% from 2009 to 2015, and even if the two sample sets were comparable, this would still probably amount to nothing more than statistical noise.

And even if these numbers were accurate and meaningful, there’s still the question of research design. They claim that grammar errors have increased, but all of the items are spelling errors, and most of them are rather obscure ones at that. At best, this study only tells us that these errors have increased that much, not that grammar errors in general have increased that much. If you’re setting out to study grammar errors (using grammar in the broad sense), why would you assume that these items are representative of the phenomenon in general?

So in sum, the study is completely bogus, and it’s obviously nothing more than an attempt to sell yet another grammar-checking service. Is it important to check your writing for errors? Sure. Can Correctica help you do that? I have no idea. But I do know that this study doesn’t show an epidemic of grammar errors as it claims to.

(Here’s the data if anyone’s interested.)

Grammar, Usage 6 Replies to “No, Online Grammar Errors Have Not Increased by 148%”

COMMENTS

6 thoughts on “No, Online Grammar Errors Have Not Increased by 148%”

Your article gave me a headache!

There are several possible explanations that spring to mind. First, the increased use of spell checkers and the resulting errors caused by the spell checker choosing the wrong word, and second, the fact that a lot of publishers have fired their copy editors. Also, some things, like the personal blogs that seem to have replaced columnists, which are more and more popular, simply don’t get copy edited.

I think it’s safe to say that the majority of content online is not copy edited. The question is whether content now is less edited or more poorly edited than it was nine years ago, but I don’t think this study can tell us that. The data here is so problematic that it’s difficult if not impossible to draw any meaningful conclusions from it, and thus there’s no need to try to explain it.

Why would “Sue is a true heroine and morphine is not something she’d mess with” be ungrammatical (or rather, as I suspect they mean, a spelling error)?

I think this is just another example of bad research design. They obviously wanted to search for the misspelling of heroin as heroine, but that would result in a lot of false positives for the word heroine.

So they searched instead for the collocation heroine and morphine, which is not a terribly common collocation and thus probably not a good indicator of just how common the misspelling heroine really is.

Plus, there’s the possibility that it’s still pulling in false positives, as in your hypothetical example, though I’d honestly be surprised if there are any. That seems like an even rarer collocation than “heroine and morphine”.

Neal Whitman pretty much said what I was going to say. These errors are not grammar errors. I don’t even think they’re spelling errors. Is there an official word for exchanging-homophones errors?
In my possibly useless opinion, some outfit that claims to have a grammar-correcting tool and doesn’t know what grammar is is much more of a threat than some fan of Bullwinkle J. Mousse.

(Uh oh, I wrote “is is.” Maybe this will end up in someone’s tally of on-line grammos.)

Arrant Pedantry