August 7, 2012

The Data Is In, pt. 2

In the last post, I said that the debate over whether data is singular or plural is ultimately a question of how we know whether a word is singular or plural, or, more accurately, whether it is count or mass. To determine whether data is a count or a mass noun, we’ll need to answer a few questions. First—and this one may seem so obvious as to not need stating—does it have both singular and plural forms? Second, does it occur with cardinal numbers? Third, what kinds of grammatical agreement does it trigger?

Most attempts to settle the debate point to the etymology of the word, but this is an unreliable guide. Some words begin life as plurals but become reanalyzed as singulars or vice versa. For example, truce, bodice, and to some extent dice and pence were originally plural forms that have been made into singulars. As some of the posts I linked to last time pointed out, agenda was also a Latin plural, much like data, but it’s almost universally treated as a singular now, along with insignia, opera, and many others. On the flip side, cherries and peas were originally singular forms that were reanalyzed as plurals, giving rise to the new singular forms cherry and pea.

So obviously etymology alone cannot tell us what a word should mean or how it should work today, but then again, any attempt to say what a word ought mean ultimately rests on one logical fallacy or another, because you can’t logically derive an ought from an is. Nevertheless, if you want to determine how a word really works, you need to look at real usage. Present usage matters most, but historical usage can also shed light on such problems.

Unfortunately for the “data is plural” crowd, both present and historical usage are far more complicated than most people realize. The earliest citation in the OED for either data or datum is from 1630, but it’s just a one-word quote, “Data.” The next citation is from 1645 for the plural count noun “datas” (!), followed by the more familiar “data” in 1646. The singular mass noun appeared in 1702, and the singular count noun “datum” didn’t appear until 1737, roughly a century later. Of course, you always have to take such dates with a grain of salt, because any of them could be antedated, but it’s clear that even from the beginning, data’s grammatical number was in doubt. Some writers used it as a plural, some used it as a singular with the plural form “datas”, and apparently no one used its purported singular form “datum” for another hundred years.

It appears that historical English usage doesn’t help much in settling the matter, though it does make a few things clear. First, there has been considerable variation in the perceived number of data (mass, singular count, or plural count) for over 350 years. Second, the purported singular form, datum, was apparently absent from English for almost a hundred years and continues to be relatively rare today. In fact, in Mark Davies’ COCA, “data point” slightly outnumbers “datum”, and most of the occurrences of “datum” are not the traditional singular form of data but other specialized uses. This is the first strike against data as a plural; count nouns are supposed to have singular forms, though there are a handful of words known as pluralia tantum, which occur only in the plural. I’ll get to that later.

So data doesn’t really seem to have a singular form. At least you can still count data, right? Well, apparently not. Nearly all of the hits in COCA for “[mc*] data” (meaning a cardinal number followed by the word data) are for things like “two data sets” or “74 data points”. It seems that no one who uses data as a plural count noun ever bothers to count their data, or when they do, they revert to using “data” as a mass noun to modify a normal count noun like “points”. Strike two, and this is a big one. The Cambridge Grammar of the English Language gives use with cardinal numbers as the primary test of countability.

Data does better when it comes to grammatical agreement, though this is not as positive as it may seem. It’s easy enough to find constructions like as these few data show, but it’s just as easy to find constructions like there is very little data. And when the word fails the first two tests, the results here seem suspect. Aren’t people simply forcing the word data to behave like a plural count noun? As this wonderfully thorough post by Norman Gray points out (seriously, read the whole thing), “People who scrupulously write ‘data’ as a plural are frequently confused when it comes to more complicated sentences”, writing things like “What is HEP data? The data themselves…”. The urge to treat data as a singular mass noun—because that’s how it behaves—is so strong that it takes real effort to make it seem otherwise.

It seems that if data really is a plural noun, it’s a rather defective one. As I mentioned earlier, it’s possible that it’s some sort of plurale tantum, but even this conclusion is unsatisfying.

Many pluralia tantum in English are words that refer to things made of two halves, like scissors or tweezers, but there are others like news or clothes. You can’t talk about one new or one clothe (though clothes was originally the plural of cloth). You also usually can’t talk about numbers of such things without using an additional counting word or paraphrasing. Thus we have news items or articles of clothing.

Similarly, you can talk about data points or points of data, but at best this undermines the idea that data is an ordinary plural count noun. But language is full of exceptions, right? Maybe data is just especially exceptional. After all, as Robert Lane Green said in this post, “We have a strong urge to just have language behave, but regular readers of this column know that, as the original Johnson knew, it just won’t.”

I must disagree. The only thing that makes data exceptional is that people have gone to such great lengths to try to get it to act like a plural, but it just isn’t working. Its irregularity is entirely artificial, and there’s no purpose for it except a misguided loyalty to the word’s Latin roots. I say it’s time to stop the act and just let the word behave—as a mass noun.

Grammar, Semantics, Usage, Words 15 Replies to “The Data Is In, pt. 2”
Jonathon Owen
Jonathon Owen


15 thoughts on “The Data Is In, pt. 2

    Author’s gravatar

    Anyone else just say “datums”?

    Author’s gravatar

    One big reason why you get so many examples of “data are” in published literature is that many publishers’ style guides insist on it. So copyeditors like me, even if we agree wholeheartedly with you on the issue, still have to change “data is” to “data are” wherever we see it. I have written about this problem and urged publishers to get rid of this rule here:

    Author’s gravatar

    Check out Google ngrams: for a comparison of “data is” and “data are”. Very strange changes in usage in the past 30 or so years.

    Author’s gravatar

    Bob: Ha! I was wondering how you’d respond. Of course, even if you use “data” as a singular mass noun, your editor or publisher may take the liberty of changing it for you.

    Anna: Exactly. Nothing’s really going to change until the people who write the style guides come around. And thanks for the link to your post. I quite enjoyed it.

    Richard: I noticed that while I was writing this but couldn’t find a good place to stick it in. Obviously “data are” is on the decline, and I would suspect that there’s been some push back against “data is” in the last few decades.

    Author’s gravatar

    We use the plural in medical writing. I’d guess that’s where most of the examples come from.

    Author’s gravatar

    My Masters thesis advisor insisted on “data are”. Consequently, I worked diligently to ensure that all sentences containing the word were written in such a way that I _never_ used “data is” (which I preferred) or “data are” (which I detested.

    It’s possible to write the word data in a sentence in such a way that it’s never the subject in this way.

    Author’s gravatar

    @Jonathon: “datum” and “datums” are used in surveying. YES, and they do NOT have quite the same meaning as “data”.

    In surveying (and elsewhere) a “datum” is an arbitrarily-chosen reference point.

    Example: potential differences are measured in units called ‘volts’; and for convenience the potential of something handy is said to be ‘0 volts’. Your power is at 120V, mine is at 220V (both relative to the land under our feet).

    Problem: a transatlantic cable can sometimes be subjected to a potential difference of several kilovolts between ‘ground[US]’ at one end and ‘earth[GB]’ at the other end.

    You can give me data about that phenomenon, but you must make it clear which datum you are using, or else the whole thing will get very muddled.

    […] a post about the whole data conundrum at some point, but until that happens, check out parts 1 and 2 of The Data Is In at Arrant Pedantry. Personally, I prefer to consider data a collective noun and use a singular […]

    Author’s gravatar

    Very nice post. I’ve sat on the fence on this one for a long time in my academic writing, but I think I’ll now just bite the bullet and always make it singular.

    I want to make a synchronic observation about ‘peas’ though. Clearly it passes all three of your tests for countability. And yet, in their natural habitat (i.e. lots of them on your plate) peas are clearly semantically mass. This hit home to me when I was asked by a non-native speaker ‘how many peas do you want?’ This strikes me as highly infelicitous, since ‘how many’ calls for an answer in the form of some kind of numeral expression, which is clearly bizarre when talking about a food that we tend to eat so many of at one time that we can’t possibly pay attention even approximately how many individual units we’re eating (just as with, e.g., rice). Of course ‘how much peas do you want?’ is totally ungrammtical, so it looks like there’s actually no simple way to express this question (though of course you could resort to circumlocutions like ‘how much do you want in the way of peas?’).

    By rights, you’d expect ‘peas’ to pattern with something like ‘rice’ in its mass/countability, and of course, originally, it did. Which makes it even more surprising that mass ‘pease’ was ever reanalysed as a plural.

    Author’s gravatar

    Interesting. I’ve always construed “insignia” as plural, though that may be because of the rarity with which I have encountered it.

    Author’s gravatar

    Chris: You’re right that there are some strange cases like peas that, while grammatically countable, don’t really seem to be countable in practice. Though my four-year-old did say he wanted “that much peas” just the other day.

    Boris: Insignia might not have been the best example. I think it’s most often used as a mass noun or as a plural count noun, though the form insignias does occur. But it’s a rare enough word that I’m honestly not sure how I’d treat it.

    Author’s gravatar

    Many things are countable in principal but not in practice. One can buy 2 grapefruits, but probably not 2 poppy seeds, or any other number. Poppy seeds are sold by weight, not by count.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.