Arrant Pedantry

By

No, Online Grammar Errors Have Not Increased by 148%

Yesterday a post appeared on QuickandDirtyTips.com (home of Grammar Girl’s popular podcast) that appears to have been written by a company called Knowingly, which is promoting its Correctica grammar-checking tool. They claim that “online grammar errors have increased by 148% in nine years”. If true, it would be a pretty shocking claim, but the numbers immediately sent up some red flags.

They searched for seventeen different errors and compared the numbers from nine years ago to the numbers from today. From the description, I gather that the first set of numbers comes from a publicly available set of data that Google culled from public web pages. The data was released in 2006 and is hosted by the Linguistic Data Consortium. You can read more about the data here, but this part is the most relevant:

We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.

So the data is taken from over a trillion words of text, but some sequences were discarded if they didn’t appear frequently enough, and you can only search sequences up to five words long. Also note that while the data was released in 2006, it does not necessarily all come from 2006; some of it could have come from web pages that were older than that.

It sounds like the second set of numbers comes from a series of Google searches—it simply says “search result data today”. It isn’t explicitly stated, but it appears that the search terms were put in quotes to find exact strings. But we’re already comparing apples and oranges: though the first set of data came from a known sample size (just over a trillion words) and and was cleaned up a bit by having outliers thrown out, we have no idea how big the second sample size is. How many words are you effectively searching when you do a search in Google?

This is why corpora usually present not just raw numbers but normalized numbers—that is, not just an overall count, but a count per thousand words or something similar. Knowing that you have 500 instances of something in data set A and 1000 instances in data set B doesn’t mean anything unless you know how big those sets are, and in this case we don’t.

This problem is ameliorated somewhat by looking not just at the raw numbers but at the error rates. That is, they searched for both the correct and incorrect forms of each item, calculated how frequent the erroneous form was, and compared the rates from 2006 to the rates from 2015. It would still be better to compare two similar datasets, because we have no idea how different the cleaned-up Google Ngrams data is from raw Google search data, but at least this allows us to make some rough comparisons. But notice the huge differences between the “then” and “now” numbers in the table below. Obviously the 2015 data represents a much larger set. (I’ve split their table into two pieces, one for the correct terms and one for the incorrect terms, to make them fit in my column here.)

Correct Term

Then

Now

jugular vein

56,409

794,000

bear in mind

931,235

35,500,000

head over heels

179,491

8,130,000

chocolate mousse

237,870

6,790,000

egg yolk

152,458

5,420,000

without further ado

120,124

1,960,000

whet your appetite

52,850

533,000

heroin and morphine

3,220

112,000

reach across the aisle

2707

117,000

herd mentality

19,444

411,000

weather vane

70906

477,000

zombie horde

21,091

464,000

chili peppers

1,105,405

29,100,000

brake pedal

138,765

1,450,000

pique your interest

8,126

296,000

lessen the burden

14,926

389,000

bridal shower

852,371

16,500,000

Incorrect Term

Then

Now

juggler vein

693

4,150

bare in mind

18,492

477,000

head over heals

12,633

398,000

chocolate moose

14,504

364,000

egg yoke

2,028

88,900

without further adieu

13,170

437,000

wet your appetite

8,930

216,000

heroine and morphine

45

3,860

reach across the isle

93

11,800

heard mentality

313

21,300

weather vein

698

16,100

zombie hoard

744

64,200

chilly peppers

2,532

155,000

brake petal

417

27,800

peek your interest

320

111,000

lesson the burden

212

91,400

bridle shower

182

157,000

But then the Correctica team commits a really major statistical goof—they average all those percentages together to calculate an overall percentage. Here’s their data again:

Incorrect Term

Then

Now

Increase

juggler vein

1.2%

0.5%

–57.2%

bare in mind

1.9%

1.3%

–31.9%

head over heals

6.6%

4.7%

–29.0%

chocolate moose

5.7%

5.1%

–11.5%

egg yoke

1.3%

1.6%

22.9%

without further adieu

9.9%

18.2%

84.5%

wet your appetite

14.5%

28.8%

99.5%

heroine and morphine

1.4%

3.3%

141.7%

reach across the isle

3.3%

9.2%

175.8%

heard mentality

1.6%

4.9%

211.0%

weather vein

1.0%

3.3%

234.9%

zombie hoard

3.4%

12.2%

256.7%

chilly peppers

0.2%

0.5%

131.8%

brake petal

0.3%

1.9%

527.9%

peek your interest

3.8%

27.3%

619.8%

lesson the burden

1.4%

19.0%

1258.6%

bridle shower

0.0%

0.9%

4315.2%

3.4%

8.4%

148.2%

They simply add up all the percentages (1.2% + 1.9% + 6.6% + . . .) and divide by the numbers of percentages, 17. But this number is meaningless. Imagine that we were comparing two items: isn’t is used 9,900 times and ain’t 100 times, and regardless is used 999 times and irregardless 1 time. This means that when there’s a choice between isn’t and ain’t, ain’t is used 1% of the time (100/(9900+100)), and when there’s a choice between regardless and irregardless, irregardless is used .1% of the time (1/(999+1)). If you average 1% and .1%, you get .55%, but this isn’t the overall error rate.

But to get an overall error rate, you need to calculate the percentage from the totals. We have to take the total number of errors and the total number of opportunities to use either the correct or the incorrect form. This gives us (1+100/((9900+999)+(100+1))), or 101/11000, which works out to .92%, not .55%.

When we count up the totals and calculate the overall rates, we get an error rate of 1.88% for then (not 3.4%) and 2.38% for now (not 8.4%). That means the increase from 2006 to 2009 is not 148.2%, but a much more modest 26.64%. (By the way, I’m not sure where they got 148.2%; by my calculations, it should be 147.1%, but I could have made a mistake somewhere.) This is still a rather impressive increase in errors from 2009 to today, but the problems with the data set make it impossible to say for sure if this number is accurate or meaningful. “Heroine and morphine” occurred 45 times out of over a trillion words. Even if the error rate jumped 141.73% from 2009 to 2015, and even if the two sample sets were comparable, this would still probably amount to nothing more than statistical noise.

And even if these numbers were accurate and meaningful, there’s still the question of research design. They claim that grammar errors have increased, but all of the items are spelling errors, and most of them are rather obscure ones at that. At best, this study only tells us that these errors have increased that much, not that grammar errors in general have increased that much. If you’re setting out to study grammar errors (using grammar in the broad sense), why would you assume that these items are representative of the phenomenon in general?

So in sum, the study is completely bogus, and it’s obviously nothing more than an attempt to sell yet another grammar-checking service. Is it important to check your writing for errors? Sure. Can Correctica help you do that? I have no idea. But I do know that this study doesn’t show an epidemic of grammar errors as it claims to.

(Here’s the data if anyone’s interested.)

By

Why Descriptivists Are Usage Liberals

Outside of linguistics, the people who care most about language tend to be prescriptivists—editors, writers, English teachers, and so on—while linguists and lexicographers are descriptivists. “Descriptive, not prescriptive!” is practically the linguist rallying cry. But we linguists have done a terrible job of explaining just what that means and why it matters. As I tried to explain in “What Descriptivism Is and Isn’t”, descriptivism is essentially just an interest in facts. That is, we make observations about what the language is rather than state opinions about how we’d like it to be.

Descriptivism is often cast as the opposite of prescriptivism, but they aren’t opposites at all. But no matter how many times we insist that “descriptivism isn’t ‘anything goes’”, people continue to believe that we’re all grammatical anarchists and linguistic relativists, declaring everything correct and saying that there’s no such thing as a grammatical error.

Part of the problem is that whenever you conceive of two approaches as opposing points of view, people will assume that they’re opposite in every regard. Prescriptivists generally believe that communication is important, that having a standard form of the language facilitates communication, and that we need to uphold the rules to maintain the standard. And what people often see is that linguists continually tear down the rules and say that they don’t really matter. The natural conclusion for many people is that linguists don’t care about maintaining the standard or supporting good communication—they want a linguistic free-for-all instead. Then descriptivists appear to be hypocrites for using the very standard they allegedly despise.

It’s true that many descriptivists oppose rules that they disagree with, but as I’ve said before, this isn’t really descriptivism—it’s anti-prescriptivism, for lack of a better term. (Not because it’s the opposite of prescriptivism, but because it often prescribes the opposite of what traditional linguistic prescriptivism does.) Just ask yourself how an anti-prescriptive sentiment like “There’s nothing wrong with singular they” is a description of linguistic fact.

So if that’s not descriptivism, then why do so many linguists have such liberal views on usage? What does being against traditional rules have to do with studying language? And how can linguists oppose rules and still be in favor of good communication and Standard English?

The answer, in a nutshell, is that we don’t think that the traditional rules have much to do with either good communication or Standard English. The reason why we think that is a little more complicated.

Linguists have had a hard time defining just what Standard English is, but there are several ideas that recur in attempts to define it. First, although Standard English can certainly be spoken, it is often conceived of as a written variety, especially in the minds of non-linguists. Second, it is generally more formal, making it appropriate for a wide range of serious topics. Third, it is educated, or rather, it is used by educated speakers. Fourth, it is supraregional, meaning that it is not tied to a specific region, as most dialects are, but that it can be used across an entire language area. And fifth, it is careful or edited. Notions of uniformity and prestige are often thrown into the mix as well.

Careful is a vague term, but it means that users of Standard English put some care into what they say or write. This is especially true of most published writing; the entire profession of editing is dedicated to putting care into the written word. So it’s tempting to say that following the rules is an important part of Standard English and that tearing down those rules tears down at least that part of Standard English.

But the more important point is that Standard English is ultimately rooted in the usage of actual speakers and writers. It’s not just that there no legislative body declaring what’s standard, but that there are no first principles from which we can deduce what’s standard. All languages are different, and they change over time, so how can we know what’s right or wrong except by looking at the evidence? This is what descriptivists try to do when discussing usage: look at the evidence from historical and current usage and draw meaningful conclusions about what’s right or wrong. (There are some logical problems with this, but I’ll address those another time.)

Let’s take singular they, for example. The evidence shows that it’s been in use for centuries not just by common folk or educated speakers but by well-respected writers from Geoffrey Chaucer to Jane Austen. The evidence also shows that it’s used in fairly predictable ways, generally to refer to indefinite pronouns or to nouns that don’t specify gender. Its use has not caused the grammar of English to collapse, and it seems like a rather felicitous solution to the gender-neutral pronoun problem. So at least from a dispassionate linguistic point of view, there is no problem with it.

From another point of view, though, there is something wrong with it: some people don’t like it. This is a social rather than a linguistic fact, but it’s a fact nonetheless. But this social fact arose because at some point someone declared—contrary to the linguistic facts—that singular they is a grammatical error that should be avoided. Here’s where descriptivists depart from description and get into anti-prescription. If people have been taught to dislike this usage, it stands to reason that they could be taught to get over this dislike.

That is, linguists are engaging in anti-prescriptivism to counter the prescriptivism that isn’t rooted in linguistic fact. So when they debunk or tear down traditional rules, it’s not that they don’t value Standard English or good communication; it’s that they think that those particular rules have nothing to do with either.

To be fair, I think that many linguists think they’re still merely describing when they’re countering prescriptive attitudes. Saying that singular they has been used for centuries by respected writers, that it appears to follow fairly well-defined rules, and that the proscription against it is not based in linguistic fact is descriptive; saying that people need to get over their dislike and accept it is not.

And this is precisely why I think descriptivism and prescriptivism not only can but should coexist. It’s not wrong to have opinions on what’s right or wrong, but I think it’s better if those opinions have some basis in fact. Guidance on issues of usage can really only be relevant and valid if it takes all the evidence into account—who uses a certain word of construction, in what circumstances, and so on. These are all facts that can be investigated, and linguistics provides a solid methodological framework for doing so. Anything that ignores the facts reduces to one sort of ipse dixit or another, either a statement from an authority declaring something to be right or wrong or one’s own preferences or pet peeves.

Linguists value good communication, and we recognize the importance of Standard English. But our opinions on both are informed by our study of language and by our emphasis on facts and evidence. This isn’t “anything goes”, or at least no more so than language has always been. People have always worried about language change, but language has always turned out fine. Inventing new rules to try to regulate language will not save it from destruction, and tossing out the rules that have no basis in fact will not hasten the language’s demise. But recognizing that some rules don’t matter may alleviate some of those worries, and I think that’s a good thing for both camps.

%d bloggers like this: