Five Tips for Avoiding Common Survey Data Analysis Errors

On December 18, 2018, an article in the Journal of the American Medical Association was retracted and replaced with a new version. The original article had been widely cited: by the Associated Press, The Washington Times, and NPR, among other outlets. CBS News ran the headline: “Fewer heavy Americans trying to lose weight.”

The original article reported that, while 56 percent of overweight or obese adults had reported trying to lose weight between 1988 and 1994, that percentage had dropped to 49 percent for those surveyed between 2009 and 2014. Pundits speculated on reasons for the reported decrease: perhaps there was greater acceptance of being overweight or people just gave up on trying to lose weight.

But the researchers had missed a change in the National Health and Nutrition Examination Survey. Before 1999, everyone was asked if he or she had tried to lose weight in the past 12 months. Starting in 1999, however, the survey asked respondents their current weight and their weight from a year ago, and, if the difference was 10 pounds or more, whether the weight loss was intentional. If they said yes, they were not asked the question “During the past 12 months, have you tried to lose weight?” — it was presumed that they had tried, since they had just answered that the weight loss was intentional. The statistic of 49 percent came from considering only the people who answered the have-you-tried-to-lose-weight question, not counting those who said they had intentionally lost 10 pounds or more.

In the corrected article, with both questions considered, it turned out the percentage of overweight and obese adults trying to lose weight in the 2009-2014 data was 58 percent, slightly higher than in the earlier time period. The authors corrected the statistics as soon as the error was pointed out to them, and some news organizations reported on the correction. But not everyone noticed the correction, and the original incorrect statistic has lived on.

Analyzing survey data can be challenging, and there is no foolproof way of preventing every possible error. But some errors can be prevented or caught before publication with the following tips.

1.       Read the questionnaire and investigate missing data. Many questionnaires have “skip patterns” where only some of the respondents are asked a particular question. For those questions, the frequency table of responses will have a lot of values that are missing — more than most of the other questions.

Of course, even questions asked of everyone can have missing data. For example, about 22 percent of households participating in the National Crime Victimization Survey do not supply information about household income, so conclusions about income depend on how the missing values are treated. 

Also check how missing values are coded in the data. I once saw a student state a very high average age for crime victims because he had forgotten that the data set used the value “99” for persons who refused to tell their ages; his analysis treated all the people with missing ages as if they were 99 years old.

2.       Use the correct weights and check their sums. Survey weights tell you how many people in the population are represented by a person in the data set. For most analyses that are intended to apply to the population, you need to use the weights (there are exceptions, but not many). Sometimes this is complicated, because a survey may have several different sets of weights.

And you can use the weights to check your analysis. The sum of the survey weights is approximately the size of the population. So for surveys of the US population, the weights should sum to approximately 330 million, the sum of the weights for persons age 85 and older should be approximately 6 million, and similarly for any group you study.

3.       Check your estimates against those published by the data producer and/or other researchers. They might not have produced exactly the same estimate as you, but you can check whether your code gives the same answer as the data producer on a different question.

Be especially vigilant about findings that confirm a desired outcome, since it’s easy to be less skeptical if the findings support your view.

4.       Construct graphs of the data. Persons analyzing survey data in the 1970s typically did not graph their data. This was partly because the data sets are large, and partly because survey data have features such as unequal weights and clustering that standard graphing commands do not display. But there are now techniques and software that will construct histograms, scatterplots, and other graphs from surveys. The student with the high average age of crime victims would have caught his mistake if he had constructed a histogram showing the spike at “age” 99.

5.       RTFM.

Once upon a time, every student of statistical computing pored over the LINPACK manual, which contained elegant Fortran subroutines developed in the 1970s to solve systems of linear equations. I spent so much time with the manual as a student that I could recite the code for many of the subroutines from memory.

The Table of Contents page of the LINPACK User’s Guide contained the quote “R.T.F.M.,” attributed to Anonymous. The boys in the class all seemed to know what the initials stood for, but when they wouldn’t share the information with me I finally gathered my courage and asked the professor. He hesitated a few seconds, then mumbled, “Well, RTM stands for Read the Manual.”

So read the friendly manual that explains the data set, survey design, weighting, and special considerations needed for analysis. All the way through. I know; you want to get your hands on the data and start making discoveries. But remember what happened to the last piece of assemble-it-yourself furniture that you put together before reading the instructions.

Copyright (c) 2019 Sharon L. Lohr