Books per Year: More Damn Lies and Statistics, Part 1
Posted by Joseph Moore on January 10, 2013
Darwin Catholic’s nice effort to list 5 books they plan to read this year generated a comment that reading 20 books a year is a lot. Which got me to thinking about charts and statistics:
The following charts and stats are just samples pulled with a few minutes of Googling around, and are mostly people I’ve never heard of before, so my criticisms and comments here are narrow, and not meant to condemn or condone what the various sites are doing. Just a little educational exercise.
This chart shows how many books people say they read per year over the last 3 years:
In this lovely chart from Statista (whoever they are), the two key words are ‘survey’ and ‘interactive’. What might they mean? Maybe I don’t know where to look, but didn’t find much help on the Statista site beyond this statement:
Statista aggregates statistical data on over 600 international industries from more than 18,000 sources, including market researchers, trade organizations, scientific journals, and government databases.
This is an interesting case where what the English is intended to say is almost the opposite of what it actually means. What it seems to say is: Look! We go around and find the best statistical analysis from all over! It’s wonderful, high-quality stuff, here!. What it actually means is: wow, since we don’t tell you much of anything about how our sources go about compiling these numbers, and the method used no doubt changes from source to source and even from chart to chart, no mere human being is going to be able to figure out if any of our numbers mean anything at all without doing a good bit of research on his own.
Wait! If I scroll down a bit, there’s a link to source - after you sign up for free. I’ll make the sacrifice – nothing I won’t do for both my readers. So, let’s do that good bit of research.
The link brings us to Harris Interactive, where the following marketing blurb is available under a tab to ‘Methods & Tools’:
Methods & Tools
With business and consumer research conducted in more than 200 countries, we combine the benefits of industry specialization and research expertise to deliver powerful insights.
- Advanced Analytics: Employing one of the industry’s strongest groups of experts in advanced quantitative methods, we bring sophisticated research methodologies to your study.
- Full Suite of Services: Includes online and traditional research. Qualitative and quantitative studies from design through analysis. We provide you with the personal service of a small firm coupled with the resources of a large one.
- Category Depth – Global Reach: Our researchers have access to what we believe to be one of the highest quality online research panels in the world. Harris Poll OnlineSM Panel respondents span a diverse range – from consumers to business professionals. Within that breadth lies the depth of our specialty panels, which target precise populations.
- Specialty Panels: A specialty panel is a subset of Harris Poll Online members who share certain interests or characteristics and who are managed internally as a community. We can develop a panel specific to your project, according to the exact characteristics of your targets, from income to product use to race to zip code – allowing for intricately specified targeting.
That’s remarkably unhelpful, if 100% buzz-word compliant. How about info on the actual chart? Hunting around some more, we find – nothing, apart from this note under the original chart:
Further information
Respondents for this survey were selected from those who have agreed to participate in Harris Interactive surveys. The data have been weighted to reflect the composition of the adult population.
If there’s anything else, I couldn’t find it in 15 minutes of poking around. So I sent the following email to the designated contact person:
I signed up for the free version of your site just now, and for kicks decided to try to track down some details on the survey methods used on the Number of Books read by Americans chart. I’m interested in how survey participants were selected to ensure the sample was random, or what ‘weighting’ approach was used to approximate such a sample. Nothing too technical, just your basic sample selection methodology. Also, ‘interactive’ means web-based? Or what?
I’ll report back if I get a response.
Why am I worried about this? Why can’t I just accept this harmless little survey at face value? Two big things and bunch of small things:
Big #1 – Self-reporting. All this (and all other) surveys do is ask people questions and analyze their answers. You don’t have to be a cynical as I am to imagine that people don’t always give accurate answers, for three reasons:
- they don’t know the answer;
- they want to give the answer they think the surveyor wants to hear;
- the give the answer they think makes them look better.
In this case, it’s possible some respondents think they might look stupid if they admit to never or rarely cracking a book, and so they fudge a little (or a lot) upwards. Who know? But a survey not followed up by some real objective research is always, always, always suspect, and often just stupid.(1) Here’s the general problem explained.
Big #2. “Interactive”. Not certain what this means, but it seems to mean ‘over the web’. Why is this an issue?
- are the kind of people who take surveys over the web a representative sample of people in general? How do you know ?
- can forms on the web be gamed? Can you figure out a way to answer the same survey more than once?
- do you ever just kind of make stuff up when filling out web surveys? Not that *I* would ever do that! What sort of jackanapes would dare allow such an outrageous insult past the barrier of his teeth!?
Of these, the first is the major concern. In this example, could there be a very high correlation both on the high and low ends of readership with failure to use the web much? Could people who read not at all or people that can’t stop reading both form a larger percentage of people who don’t use the web than other people? I don’t know, but that’s the sort of thing you’d need to know before throwing up a bunch of colorful charts and numbers.
As for the smaller problems – maybe we should just call them questions – they are many. Here are a couple:
- what does ‘read a book’ mean? Cover to cover? A couple pages from the middle? The back dust cover? Does having read ‘Goodnight Moon’ and ‘The Sneeches’ a million times count as 2 books, a million books, or no books?
- Even if the surveyor is clear by what they mean by ‘read a book’, are the survey takers? Even if they are trying to be honest?
- the variations from year to year seem extreme. Over a population including students, moms and dads, retired people, poor people, rich people, and so on – is it really reasonable to expect the huge swings we see in a couple of places? What would cause such a thing? I’d want that investigated rather than just reported – when you just report it, it tends to make people wonder more about your methods than the reported change.
- n = 2,056. OK, that may be a perfectly wonderful number, but how would I know? Mike Flynn might know – the statisticians I work with seem to be comfortable with some pretty small sample sizes as often as not.
- other surveys. I’ve watched with bemused, slight interest surveys such as this over the years, and the numbers are wildly variable. Does the Average American adult read 5 books a year? 9? 15? 17? Google around, and those numbers pop up. I tend to think these numbers are way overstated, that people claim to read a lot more books than they actually do – but I’m a skeptic.
Next, when I get a minute, I’ll google around for some other info, and we’ll do it all again! Weeee!
1. Via Wikipedia: Aronson, Wilson, Akert 2010, “Most of us have moderate to high self-esteem. Like the mythical residents of Garrison Keillor’s Lake Wobegon, we need to believe that we are above average. For example, in a survey of a million high school students, only 2 percent stated that they were below average in their leadership ability (Gilovich 1991)”, Social Psychology, 150: 9780138144579
This entry was posted on January 10, 2013 at 7:33 pm and is filed under Authors, Modern Cluelessness, Science!, Thoughts. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Mike Flynn said
A sample size of 2056 would be sufficient for a precision of ±2.5% if the parameter value is p=50% (a conservative choice) provided:
a) the population is a population; that is, has a single value for p for all strata.
b) the sample is a simple random sample.
There is nothing less common than an actually random sample. (Random chance requires intelligent design.)
The biggest problems are never precision of the sample, but bias in the sample. This need not be intentional. In stats, “bias” is anything that knocks the data off the straight and narrow. For example, the way a question is worded, the procedure for handling non-response, etc.
Ishmael Alighieri said
Thanks. I thought it probably was, as I mentioned – it was more just throwing out that one number, as if it means anything without the other stuff you mention. Maybe I’m way off, but I seriously doubt that a couple of thousand people who are willing to do on-line Harris polls could be just assumed to be representative of the American population. There’s no relationship between those who never read a book and those who never fill out on-line surveys?
Mike Flynn said
My suspicion would be that the more books one reads, the less time spent on-line.
A couple thousand people could be a representative sample if it is a random sample; but collecting a random sample under such conditions is heroically impossible. It’s less how many units were sampled than how the units were sampled. A simple, if out-of-print book is “A Sampler on Sampling” by Bill Williams.
Ishmael Alighieri said
Thanks. Sorely tempted to learn something about statistics, if only to prove my brain has not completely ossified. But I also need to read and reread a bunch of Aristotle and Thomas! There are simply more good things to know than there is time to learn them.
My uneducated suspicion is that getting a representative sample isn’t even the biggest problem here – it’s lack of clear definitions and the perils of self-reporting data that reflects on the person’s intelligence – that’s not a recipe for getting meaningful results.