This is something I wrote over four years ago, and is not exactly related to programming or computer science. It is instead about statistics and to some degree logic, and it is a piece of writing I still believe to have value. So, unedited and without further ado:
A trend in almost all online discussion of statistical study is to point out the so-called "correlation-causation fallacy" - that is, "correlation does not equal causation."
This is of course true, and is well worth pointing out in some situations. I would estimate that the correlation-causation fallacy is likely second only to ad hominem in terms of fallacies commonly found in public dialogue. Closely related is the concern for which direction the causation may work, but I will save that issue for another time.
For those who are unaware, the correlation-causation fallacy in a nutshell is any sort of argument that goes along the lines of "I observe A happening at the same time as B, therefore A causes B." Stated in plain logical terms it is clear why it is fallacious, but when dressed up in suitable rhetoric - "Those kids are always playing violent videogames and listening to bad music and etc... and they're also doing bad in school, so videogames and bad music and etc. must make you bad at school" - it becomes a very tempting (though still quite wrong) argument indeed. The main danger is that both A and B can be explained by some external factor C, say in the case of the previous example, inattentive parents.
However, this criticism is often leveled against statistical studies. Again, this is not entirely without merit, especially if one is critiquing the specific headline or way a study was framed by the media (which is often inaccurate and overly generalized). However, to use the "correlation-causation fallacy" as a rhetorical cudgel with which to dismiss any and all statistical findings (or at least those you don't like) is a fallacy in and of itself, hence this writeup.
Those who overuse the "correlation is not causation" line often have little understanding for how a proper statistical study is actually conducted. For an example, see this discussion on Slashdot. It's about a recently published study which generally concludes that a few drinks a day is healthy, or at least not too unhealthy. Here's one comment that was highly moderated (e.g. approved generally by the community, which in the case of Slashdot consists of a reasonably intelligent mix of mostly male geeks):
The Old Correlation-Causation Confusion
Well, that would be *excellent*, I love a glass of wine or three a day. A beer or two on a hot day is just heavenly. But unfortunately the correlation may not imply causation. i.e. people who live longer drink more, but not vice-versa.
Maybe really sick people don't drink as much.
Maybe the people that have four drinks a day have to be quite healthy to keep that up day after day after day.
Maybe drinking keeps them off the streets, or out of other dangerous places.
Maybe all the 4-drink-a-day people have died already and were not around for a survey.
Lotsa possible ways to spoil things.
Another highly moderated comment:
Stats 101...
Correlation does not imply causation. All we can say is that "people who drink a bit of alcohol tend to live longer," not that alcohol prolongs their lives. It could be that these individuals take the time to socialize and de-stress, which causes them to live longer. Or perhaps there are financial factors at play: someone who can afford to drink three or four bottles of wine a week is not likely to be living in abject poverty. Of course, it could also be that anti-oxidant properties of the beverages have a positive effect as well.
It is worth noting that there was actually a
reasonably insightful reply to the above comment, and I will essentially expand upon what it said here. Both of the above comments, despite their erudition in using the scholarly-sounding terms "correlation" and "causation", are actually a display of general statistical ignorance. Upon examining
the news report about the study, it becomes clear that this is not the sort of result that can be so casually dismissed. A key excerpt:
Their conclusion is based on pooled data from 34 large studies involving more than 1 million people and 94,000 deaths.
This was a very large study, and the scope of it suggests to me that those who conducted it are likely well aware of the issues of correlation and causation and that the former does not necessarily imply the latter. In fact, typical statistical methods (including the ones likely used in this study) are built explicitly to help control for these issues. Newsmen and pundits may make the correlation-causation fallacy, but someone who has spent years studying regression analysis is unlikely to. This is not to say that all academic statistical work is flawless - in fact, the more of it I see, the more flaws I see. However, the mistakes are often much like the work itself - very complex. One generally cannot dismiss an academic study with one sentence and a few logical fallacy terms (there are some situations where you can, but I don't think this particular study is one of them).
Don't worry though, I'm not going to just wave my hands here and expect you to believe me. Here is roughly how statistical studies control for the issue of correlation versus causation, among other things: first, it all comes down to your data, and your data depends on your sampling technique. Here they used a pooled sample, combining the results from 34 previous studies into a quite tremendous one million person sample. While we do not know how the individual studies were conducted, it is safe to say that with such a large total sample it should be possible for competent researchers to build a sample generally representative of the total population. That is, given that the world demographics are known (roughly 50/50 gender split, a generally bell-shaped distribution for age as there are few babies and few really old and mostly in the middle, etc.), the researchers can pick and choose data based on these characteristics to have results which can better model back on to reality.
Of course, the researchers should randomly pick and choose based on these factors, and not based on other ones (for a health related study, it would bias your results a great deal to choose a sample based on preexisting health conditions, e.g. study exclusively healthy or exclusively sick people). And this brings me to another technique of sampling - random sampling. If you cannot collect such a large sample as to allow you to construct a balanced sample, then you can simply choose people at random, thus normalizing all other factors. If done properly, such a study can have pretty good statistical power with a sample as small as a thousand people. True random sampling is increasingly difficult in the modern day, though - the sorts of people who will respond to surveys and studies are different than those who won't, and that alone will bias your sample.
Now, why does having a balanced sample (constructed or random) help with the correlation-causation fallacy? In the words of Sherlock Holmes, "when you have eliminated the impossible, whatever remains, however improbable, must be the truth." That is to say, if your study is adequately controlled for possible external "C" factors (as discussed earlier), then it is reasonable to conclude that the relationship between A and B is causal (though as said earlier, the direction of the cause is another issue).
In the case of a medical study, that means controlling for characteristics that would be pertinent in terms of health. If you're studying the effects of alcohol, you don't want to survey just healthy people or just sick people, but rather a suitable mix of both. In the field of political science, controls have more to do with, well, political, social, and economic issues. If you want to argue, in international relations, that democracies do not go to war with each other (the "democratic peace", common in both academic papers and presidential speeches), then you would do well to control for GDP (to defend against the argument that democracies simply happen to be wealthy and it is wealthy countries that avoid mutual war, on account of the prohibitive cost of suspending trade and disrupting industry). Of course the argument gets more complex if somebody asserts that democracy causes wealth and then wealth causes peace - while this may somewhat save the "democratic peace", the causal chain must be defended from possible alternative explanations at each link.
Some issues are so slippery, with so many possible causes, that it is very difficult to get statistical traction on them. There are sophisticated methods to help with this, most of which I only have a vague understanding of at the moment (check back in a few years). But yes, some assertions are beyond reasonable testing, particularly when you cannot control the behavior of your objects of study (that is, you are not in an experimental setting such as a laboratory but are rather trying to observe real-world issues such as war). A currently hot field in political science is to try to use a more experimental approach, and this is somewhat possible in domestic politics or public opinion issues where you can take steps to affect your object of study. In the case of international relations, though, it is unlikely that academics will ever be able to tell countries to go to war or not simply in the name of science.
And so, the bottom line is that it is still quite reasonable to be suspicious of statistics, especially when they are being cited by the media and/or politicians. Even when a study is valid, the results are often twisted by an intern who just read the abstract and decided it would make good political fodder in a campaign ad. But just as correlation does not imply causation, suspicion should not entail dismissal - be cautious, but still give studies a decent thinking-through at the least before concluding they are either right or wrong. Ask yourself these questions:
- Did they build their sample in a reasonably unbiased way?
- Is there a clear mechanism to explain why their independent variable(s) leads to their dependent variable?
- Have they accounted for any superior alternative explanations that I can come up with?
If the answer to all three of these is "yes", then it is reasonable to say that the study is accurate. If the answers are more mixed, well, then deal with them as the situation justifies.
Thank you for reading.
My main reason for posting this is posterity - the blog I originally posted this on is no longer online. I also find parts of it (most notably the bits with Slashdot) amusingly anachronistic.
But all in all, I do like the fact that what I wrote still seems sensible to me four years later. It's also still relevant - as was joked in last weeks entry, programmers often pretend they are good at math (for example by exclaiming "correlation does not equal causation!"), but in reality that is not their main strength.
Causality is always difficult to tease out in any situation, statistical or not. A rather strong philosophical case can be made that causality is always uncertain at best (
though practically speaking one can usually be pretty sure). But the whole point of regression analysis is to control for these spurious variables and get at the underlying mechanism - it's not foolproof, but it's not foolhardy either.