Arrant Pedantry

By

My Thesis

I’ve been putting this post off for a while for a couple of reasons: first, I was a little burned out and was enjoying not thinking about my thesis for a while, and second, I wasn’t sure how to tackle this post. My thesis is about eighty pages long all told, and I wasn’t sure how to reduce it to a manageable length. But enough procrastinating.

The basic idea of my thesis was to see which usage changes editors are enforcing in print and thus infer what kind of role they’re playing in standardizing (specifically codifying) usage in Standard Written English. Standard English is apparently pretty difficult to define precisely, but most discussions of it say that it’s the language of educated speakers and writers, that it’s more formal, and that it achieves greater uniformity by limiting or regulating the variation found in regional dialects. Very few writers, however, consider the role that copy editors play in defining and enforcing Standard English, and what I could find was mostly speculative or anecdotal. That’s the gap my research aimed to fill, and my hunch was that editors were not merely policing errors but were actively introducing changes to Standard English that set it apart from other forms of the language.

Some of you may remember that I solicited help with my research a couple of years ago. I had collected about two dozen manuscripts edited by student interns and then reviewed by professionals, and I wanted to increase and improve my sample size. Between the intern and volunteer edits, I had about 220,000 words of copy-edited text. Tabulating the grammar and usage changes took a very long time, and the results weren’t as impressive as I’d hoped they’d be. There were still some clear patterns, though, and I believe they confirmed my basic idea.

The most popular usage changes were standardizing the genitive form of names ending in -s (Jones’>Jones’s), which>that, towards>toward, moving only, and increasing parallelism. These changes were not only numerically the most popular, but they were edited at fairly high rates—up to 80 percent. That is, if towards appeared ten times, it was changed to toward eight times. The interesting thing about most of these is that they’re relatively recent inventions of usage writers. I’ve already written about which hunting on this blog, and I recently wrote about towards for Visual Thesaurus.

In both cases, the rule was invented not to halt language change, but to reduce variation. For example, in unedited writing, English speakers use towards and toward with roughly equal frequency; in edited writing, toward outnumbers towards 10 to 1. With editors enforcing the rule in writing, the rule quickly becomes circular—you should use toward because it’s the norm in Standard (American) English. Garner used a similarly circular defense of the that/which rule in this New York Times Room for Debate piece with Robert Lane Greene:

But my basic point stands: In American English from circa 1930 on, “that” has been overwhelmingly restrictive and “which” overwhelmingly nonrestrictive. Strunk, White and other guidebook writers have good reasons for their recommendation to keep them distinct — and the actual practice of edited American English bears this out.

He’s certainly correct in saying that since 1930 or so, editors have been changing restrictive which to that. But this isn’t evidence that there’s a good reason for the recommendation; it’s only evidence that editors believe there’s a good reason.

What is interesting is that usage writers frequently invoke Standard English in defense of the rules, saying that you should change towards to toward or which to that because the proscribed forms aren’t acceptable in Standard English. But if Standard English is the formal, nonregional language of educated speakers and writers, then how can we say that towards or restrictive which are nonstandard? What I realized is this: part of the problem with defining Standard English is that we’re talking about two similar but distinct things—the usage of educated speakers, and the edited usage of those speakers. But because of the very nature of copy editing, we conflate the two. Editing is supposed to be invisible, so we don’t know whether what we’re seeing is the author’s or the editor’s.

Arguments about proper usage become confused because the two sides are talking past each other using the same term. Usage writers, editors, and others see linguists as the enemies of Standard (Edited) English because they see them tearing down the rules that define it, setting it apart from educated but unedited usage, like that/which and toward/towards. Linguists, on the other hand, see these invented rules as being unnecessarily imposed on people who already use Standard English, and they question the motives of those who create and enforce the rules. In essence, Standard English arises from the usage of educated speakers and writers, while Standard Edited English adds many more regulative rules from the prescriptive tradition.

My findings have some serious implications for the use of corpora to study usage. Corpus linguistics has done much to clarify questions of what’s standard, but the results can still be misleading. With corpora, we can separate many usage myths and superstitions from actual edited usage, but we can’t separate edited usage from simple educated usage. We look at corpora of edited writing and think that we’re researching Standard English, but we’re unwittingly researching Standard Edited English.

None of this is to say that all editing is pointless, or that all usage rules are unnecessary inventions, or that there’s no such thing as error because educated speakers don’t make mistakes. But I think it’s important to differentiate between true mistakes and forms that have simply been proscribed by grammarians and editors. I don’t believe that towards and restrictive which can rightly be called errors, and I think it’s even a stretch to call them stylistically bad. I’m open to the possibility that it’s okay or even desirable to engineer some language changes, but I’m unconvinced that either of the rules proscribing these is necessary, especially when the arguments for them are so circular. At the very least, rules like this serve to signal to readers that they are reading Standard Edited English. They are a mark of attention to detail, even if the details in question are irrelevant. The fact that someone paid attention to them is perhaps what is most important.

And now, if you haven’t had enough, you can go ahead and read the whole thesis here.

By

The Data Is In, pt. 2

In the last post, I said that the debate over whether data is singular or plural is ultimately a question of how we know whether a word is singular or plural, or, more accurately, whether it is count or mass. To determine whether data is a count or a mass noun, we’ll need to answer a few questions. First—and this one may seem so obvious as to not need stating—does it have both singular and plural forms? Second, does it occur with cardinal numbers? Third, what kinds of grammatical agreement does it trigger?

Most attempts to settle the debate point to the etymology of the word, but this is an unreliable guide. Some words begin life as plurals but become reanalyzed as singulars or vice versa. For example, truce, bodice, and to some extent dice and pence were originally plural forms that have been made into singulars. As some of the posts I linked to last time pointed out, agenda was also a Latin plural, much like data, but it’s almost universally treated as a singular now, along with insignia, opera, and many others. On the flip side, cherries and peas were originally singular forms that were reanalyzed as plurals, giving rise to the new singular forms cherry and pea.

So obviously etymology alone cannot tell us what a word should mean or how it should work today, but then again, any attempt to say what a word ought mean ultimately rests on one logical fallacy or another, because you can’t logically derive an ought from an is. Nevertheless, if you want to determine how a word really works, you need to look at real usage. Present usage matters most, but historical usage can also shed light on such problems.

Unfortunately for the “data is plural” crowd, both present and historical usage are far more complicated than most people realize. The earliest citation in the OED for either data or datum is from 1630, but it’s just a one-word quote, “Data.” The next citation is from 1645 for the plural count noun “datas” (!), followed by the more familiar “data” in 1646. The singular mass noun appeared in 1702, and the singular count noun “datum” didn’t appear until 1737, roughly a century later. Of course, you always have to take such dates with a grain of salt, because any of them could be antedated, but it’s clear that even from the beginning, data‘s grammatical number was in doubt. Some writers used it as a plural, some used it as a singular with the plural form “datas”, and apparently no one used its purported singular form “datum” for another hundred years.

It appears that historical English usage doesn’t help much in settling the matter, though it does make a few things clear. First, there has been considerable variation in the perceived number of data (mass, singular count, or plural count) for over 350 years. Second, the purported singular form, datum, was apparently absent from English for almost a hundred years and continues to be relatively rare today. In fact, in Mark Davies’ COCA, “data point” slightly outnumbers “datum”, and most of the occurrences of “datum” are not the traditional singular form of data but other specialized uses. This is the first strike against data as a plural; count nouns are supposed to have singular forms, though there are a handful of words known as pluralia tantum, which occur only in the plural. I’ll get to that later.

So data doesn’t really seem to have a singular form. At least you can still count data, right? Well, apparently not. Nearly all of the hits in COCA for “[mc*] data” (meaning a cardinal number followed by the word data) are for things like “two data sets” or “74 data points”. It seems that no one who uses data as a plural count noun ever bothers to count their data, or when they do, they revert to using “data” as a mass noun to modify a normal count noun like “points”. Strike two, and this is a big one. The Cambridge Grammar of the English Language gives use with cardinal numbers as the primary test of countability.

Data does better when it comes to grammatical agreement, though this is not as positive as it may seem. It’s easy enough to find constructions like as these few data show, but it’s just as easy to find constructions like there is very little data. And when the word fails the first two tests, the results here seem suspect. Aren’t people simply forcing the word data to behave like a plural count noun? As this wonderfully thorough post by Norman Gray points out (seriously, read the whole thing), “People who scrupulously write ‘data’ as a plural are frequently confused when it comes to more complicated sentences”, writing things like “What is HEP data? The data themselves…”. The urge to treat data as a singular mass noun—because that’s how it behaves—is so strong that it takes real effort to make it seem otherwise.

It seems that if data really is a plural noun, it’s a rather defective one. As I mentioned earlier, it’s possible that it’s some sort of plurale tantum, but even this conclusion is unsatisfying.
Many pluralia tantum in English are words that refer to things made of two halves, like scissors or tweezers, but there are others like news or clothes. You can’t talk about one new or one clothe (though clothes was originally the plural of cloth). You also usually can’t talk about numbers of such things without using an additional counting word or paraphrasing. Thus we have news items or articles of clothing.

Similarly, you can talk about data points or points of data, but at best this undermines the idea that data is an ordinary plural count noun. But language is full of exceptions, right? Maybe data is just especially exceptional. After all, as Robert Lane Green said in this post, “We have a strong urge to just have language behave, but regular readers of this column know that, as the original Johnson knew, it just won’t.”

I must disagree. The only thing that makes data exceptional is that people have gone to such great lengths to try to get it to act like a plural, but it just isn’t working. Its irregularity is entirely artificial, and there’s no purpose for it except a misguided loyalty to the word’s Latin roots. I say it’s time to stop the act and just let the word behave—as a mass noun.

By

The Data Is In, pt. 1

Lately there has been a spate of blog posts on the question of whether data is a singular or a plural noun. Surprisingly, most of them come down on the side of saying that it can be singular—except when it’s plural. Although saying that it can be singular is refreshingly open-minded, I’ve still got a few problems with the facts and reasoning that led them to that conclusion, as well as the wishy-washiness of saying that it’s singular except when it isn’t.

The first post, “Is Data Is, or Is Data Ain’t, a Plural?”, came from the Wall Street Journal, and it took what Robert Lane Greene of the Economist blog Johnson called “an unusually fence-sitting position“: although they say that they “hereby join the majority” by accepting it as either singular or plural, they predict that “the plural will continue to dominate in our prose”. And they give this head-scratching reasoning:

Singular verbs now are often used to refer to collections of information: Little data is available to support the conclusions.

Otherwise, generally continue to use the plural: Data are still being collected.

Isn’t all data—whether you think of it as a count or a mass noun—“collections of information”? Just because something’s in a collection doesn’t mean it’s singular. For example, if I had an extensive rock collection, you probably wouldn’t say that I had a lot of rock, though I suppose you could; you’d probably say that I have a lot of rocks. The number really depends on the way we perceive the things in the collection, not on the fact that it’s in a collection. But if that wasn’t confusing enough, they give this unreliable test of data‘s number:

As a singular/plural test, try to substitute statistics for data: It doesn’t work in the first case — little statistics is available — so the singular is fails to pass muster. The substitution does work in the second case — statistics are still being collected – so the plural are passes muster. (italics added for clarity)

Doesn’t this test simply tell you that data should always be plural? In what case would the singular is ever pass muster? Either I’m missing something important about how you’re supposed to use this substitution test or it’s simply broken.

Next came this post on the Guardian‘s Datablog. Sadly, it’s even more muddled than the Wall Street Journal post, and it’s depressingly light on data. It simply asserts, without examination,

Strictly-speaking, data is a plural term. Ie, if we’re following the rules of grammar, we shouldn’t write “the data is” or “the data shows” but instead “the data are” or “the data show”.

But despite further assertions that data is “strictly a plural”, the Guardian style guide says, “Data takes a singular verb”, though they correctly note that (virtually) “no one ever uses ‘agendum’ or ‘datum’”. But this idoesn’t make much sense; if it’s plural, why does it take a singular verb? And if it takes a singular verb, is it really plural?

The Guardian post also linked to this National Geographic post from a few years ago, which says much the same thing but somehow manages to be even more muddled. It starts off badly by saying that “data is often used as a collective noun referring to information, statistics, and the like”. Here they mean “mass noun”, not “collective noun”. Note that the Wikipedia articles each say at the top that these terms should not be confused. But aside from this basic mistake, note how it seems to contradict the Wall Street Journal post, which says that singular verbs are used for collections of information.

I wondered if this was just a simple error in the National Geographic post; from context, I would have expected the so-called “collective” form to use a singular verb. But in the next paragraph they say that their style is to use data as a plural when “referring to a body of facts, figures, and such.”

The post gets even more confusing, pointing out some of National Geographic‘s supposed errors and then saying that both the singular and plural are considered standard. If they’re both standard, then how are their examples errors? The post ends with a red herring about avoiding confusion and the bizarre statement, “I’d rather not box writers into a singular form.” So why box them into a plural form? If there’s a distinction to be made, even a subtle one, between data as a mass noun and data as a singular noun, why not encourage it? Why whitewash over it by insisting that data always be plural?

Ultimately, though, this whole debate rests on one question: how do we know whether a word is plural or singular? And that’s what I’ll tackle next time.

Read part 2 here.

By

It’s just a joke. But no, seriously.

I know I just barely posted about the rhetoric of prescriptivism, but it’s still on my mind, especially after the recent post by David Bentley Hart and the responses by response by John E. McIntyre (here and here) and Robert Lane Greene. I know things are just settling down, but my intent here is not to throw more fuel on the fire, but to draw attention to what I believe is a problematic trend in the rhetoric of prescriptivism. Hart claims that his piece is just some light-hearted humor, but as McIntyre, Greene, and others have complained, it doesn’t really feel like humor.

That is, while it is clear that Hart doesn’t really believe that the acceptance of solecisms leads to the acceptance of cannibalism, it seems that he really does believe that solecisms are a serious problem. Indeed, Hart says, “Nothing less than the future of civilization itself is at issue—honestly—and I am merely doing my part to stave off the advent of an age of barbarism.” If it’s all a joke, as he says, then this statement is somewhat less than honest. And as at least one person says in the comments, Hart’s style is close to self-parody. (As an intellectual exercise, just try to imagine what a real parody would look like.) Perhaps I’m just being thick, but I can only see two reasons for such a style: first, it’s a genuine parody designed to show just how ridiculous the peevers are, or second, it’s a cover for genuine peeving.

I’ve seen this same phenomenon at work in the writings of Lynne Truss, Martha Brockenbrough, and others. They make some ridiculously over-the-top statements about the degenerate state of language today, they get called on it, and then they or their supporters put up the unassailable defense: It’s just a joke, see? Geez, lighten up! Also, you’re kind of a dimwit for not getting it.

That is, not only is it a perfect defense for real peeving, but it’s a booby-trap for anyone who dares to criticize the peever—by refusing to play the game, they put themselves firmly in the out group, while the peeve-fest typically continues unabated. But as Arnold Zwicky once noted, the “dead-serious advocacy of what [they take] to be the standard rules of English . . . makes the just-kidding defense of the enterprise ring hollow.” But I think it does more than just that: I think it undermines the credibility of prescriptivism in general. Joking or not, the rhetoric is polarizing and admits of no criticism. It reinforces the notion that “Discussion is not part of the agenda of the prescriptive grammarian.”[1] It makes me dislike prescriptivism in general, even though I actually agree with several of Hart’s points of usage.

As I said above, the point of this post was not to reignite a dying debate between Hart and his critics, but to draw attention to what I think is a serious problem surrounding the whole issue. In other words, I may not be worried about the state of the language, but I certainly am worried about the state of the language debate.

  1. [1] James Milroy, “The Consequences of Standardisation in Descriptive Linguistics,” in Standard English: The Widening Debate, ed. Tony Bex and Richard J. Watts (New York: Routledge, 1999), 21.