Soycode: November 2010

Sunday, November 28, 2010

Observing causality and avoiding bias

Two weeks ago I posted about the correlation-causation fallacy: today my focus is on how much we can hope to learn about causality through observation. For more academically steeped background reading I suggest this essay by Andrew Gelman - here I will attempt to portray similar thoughts but with fewer citations.

Controlled experiments are the "official" way to determine causality, but there are many interesting questions in this world that cannot be treated in a controlled environment. Statistical hypothesis testing is essentially an attempt to use observational data to create a "quasi-experiment", where we have cases and treatments and controls and, perhaps, causality. But a common claim is that, no matter what your computer says after it inverts those really big matrices, you need some underlying theory (e.g. reason, explanation) for what is going on before we can talk about it causally.

Unfortunately, in reality this leads to people fitting data to models rather than models to data. They develop their "theoretically informed" viewpoint, then go around looking for data to validate it. This is even worse outside of academia, where the viewpoint may not be theoretically informed but rather "business" informed.

But I agree that numbers without theory mean very little - even if you believe the causal relationship solely based on the numerical results, you still need some sort of viewpoint to have a meaningful interpretation (and then presumably suggestions for taking action based on it, depending on your situation). So, I see the problem not just as teasing out causality but also as avoiding the bias introduced by our theoretical musing while still adding something to the study beyond number crunching.

There are many statistical techniques that can be employed, but the key to causality is in the "higher level" design and data used. As Gelman puts it:

The most compelling causal studies have (i) a simple structure that you can see through to the data and the phenomenon under study, (ii) no obvious plausible source of major bias, (iii) serious efforts to detect plausible biases, efforts that have come to naught, and (iv) insensitivity to small and moderate biases (see, for example, Greenland, 2005).

In other words, your analysis (data and theory) should be no more complicated than absolutely necessary and you should take pains to be open minded and genuinely consider alternatives. Though not statistical, this relates to the argument in my previous entry about education - words like "obviously" and "clearly" have no place in serious writing. Obviously.

Both simplicity and open mindedness are rather difficult goals for analysis. Appearing complex can be key to "selling" an argument, be it a paper you are trying to publish or a product you want your company to make. People are naturally intimidated but excessive precision and other telltale signs of statistical amateurism - 45% doesn't sound anywhere near as accurate as 44.72%, even if your confidence interval dwarfs the rounding. The main key to overcoming this is a bit of actually learning statistics and a lot of being willing to call people on nonsense and not pandering to sell things yourself.

And open mindedness is difficult in any realm - after all, we spend all our time inside our own heads. I believe that taking philosophy/logic classes (or at least having real philosophical/logical discussions) and actively trying to play "devil's advocate" is a fantastic method to recondition our natural urge to delegitimize and dismiss opposing views. Even a perspective that you feel is horribly flawed is likely at least internally cohesive, if you're willing to tweak a few axioms.

So there you have it, two simple but difficult guidelines on the quest for causality. Of course, there are many more specific bugaboos to be concerned with. By my view, causal direction is the next largest issue (after those addressed in the two causality entries I've written already). I may at some point elaborate on it, but in a nutshell it's always good to remember that even the most airtight statistical study doesn't really tell you which way your arrow is going - and in reality, it's almost certainly pointing both ways, and even beyond to further factors. The best approach is to make an argument for the "strongest", but not only, causal relationship.

For now, whether you produce or consume statistics (and I assure you that at least the latter is true), keep a simple and open minded approach. Thanks for reading!

Sunday, November 21, 2010

Learning and open source

"Dammit Fry, I can't teach - I'm a professor!" - Professor Farnsworth, Futurama

Education is an industry - or rather, education is a byproduct and imitation of industry. Educational institutions serve as culling mechanisms in a system with competition, much as money serves as proxy for power in a system with scarcity. But what of learning?

Learning is, at best, moderately correlated with education. By coercing students to repeatedly immerse in material, the hope is some genuine absorption and perhaps even understanding occurs. But the primary concern, for both students and school, is in grading and ranking - supposedly measurable results that then allow jobs to better choose amongst their applicants.

Before I continue, I want to caveat my cynicism - schools are full of good people, many of whom are genuinely trying to learn and improve society. My concerns lie chiefly with the force of extrinsic motivation (e.g. grades, career) on those without power (students, most teachers) and with the priorities of those with power. Schools are in fact still in a better state than many other institutions - but my expectations for schools are higher, as I believe education to be the key to a successful society.

My main goal here is not to criticize what we do have - it is to describe what we should. Grades, tests, interviews, jobs - these are well established as necessary evils, and I'll admit I can't come up with cures for them. However, I do believe that true learning can be a more significant component of education, even with these other ills. The main key? Openness.

I have taken courses in a variety of environments, ranging from quite prestigious to quite not. I have used books and other educational resources ranging from quite proprietary to very open. And I have found in every case that openness and accountability leads to better teaching and true learning, if the students choose it.

That last point is key - students must choose. Education is a treadmill, but learning cannot be forced. My philosophy on this is relatively simple - if they choose not to learn, then that is truly their loss. Attempting to force the uninterested will simply hurt the interested.

A small portion of textbooks and educational material is available openly, "free as in speech" - and besides being at a price any student can afford, this material tends to receive more feedback and thus be higher quality. Super expensive and restrictive material, either distributed in insultingly priced textbooks of DRM'd PDFs, tends to have less transparency and is actually lower quality. The overuse of phrases like "clearly", "obvious", and "trivial" by lazy writers with other priorities (read: research) leads to dense tomes that are only useful to those who have already spent years with the subject. If something is clear or trivial, then it shouldn't take much time to show it.

Actual lectures are similar - turn on a video camera, and teachers teach better. Several institutions have put high quality lectures on YouTube, and if you watch them you'll find that the teachers are, well, teaching. Of course they were likely selected because of this, but even an inferior teacher will spend some effort to improve if they know their lectures will be shared broadly and in perpetuity.

Ultimately the situation is much like it is with software (hence the relevance here) - openness leads to quality. I have seen closed code and closed courses, and both often have only the absolute minimal effort required in them. If the author/teacher has any other priorities (hobbies, research, anything), why should they bother putting anything beyond the absolute minimum into their work?

Open source code and open education materials have accountability, which forces their creator to make them higher quality or face embarrassment. In the worst case where they still don't put much effort, at least somebody else can come along and expand/fork the effort. You can't beat the price, and the availability (and typically high quality solution keys) is better for autodidacts as well.

I'll close with two specific suggestions that I think are compatible with the competitive nature of education but will still improve learning. Firstly, all educational material should be licensed in a free/open nature. Secondly, all classrooms should have cameras in the back and have some chance (say 5%) of being recorded and shared.

I would wager that any educational institution willing to enact those simple steps would find the quality of their lectures and educational resources vastly improved. Of course, those steps run counter to the incentives of some powerful publishers and administrators, hence they are unlikely to happen any time soon. But they are compatible with the necessary evils of grades, competition, and industry, and would at least allow those who want to learn and not just advance their career to be better able to do so.

Sunday, November 14, 2010

The correlation-causation fallacy fallacy

This is something I wrote over four years ago, and is not exactly related to programming or computer science. It is instead about statistics and to some degree logic, and it is a piece of writing I still believe to have value. So, unedited and without further ado:

A trend in almost all online discussion of statistical study is to point out the so-called "correlation-causation fallacy" - that is, "correlation does not equal causation."

This is of course true, and is well worth pointing out in some situations. I would estimate that the correlation-causation fallacy is likely second only to ad hominem in terms of fallacies commonly found in public dialogue. Closely related is the concern for which direction the causation may work, but I will save that issue for another time.

For those who are unaware, the correlation-causation fallacy in a nutshell is any sort of argument that goes along the lines of "I observe A happening at the same time as B, therefore A causes B." Stated in plain logical terms it is clear why it is fallacious, but when dressed up in suitable rhetoric - "Those kids are always playing violent videogames and listening to bad music and etc... and they're also doing bad in school, so videogames and bad music and etc. must make you bad at school" - it becomes a very tempting (though still quite wrong) argument indeed. The main danger is that both A and B can be explained by some external factor C, say in the case of the previous example, inattentive parents.

However, this criticism is often leveled against statistical studies. Again, this is not entirely without merit, especially if one is critiquing the specific headline or way a study was framed by the media (which is often inaccurate and overly generalized). However, to use the "correlation-causation fallacy" as a rhetorical cudgel with which to dismiss any and all statistical findings (or at least those you don't like) is a fallacy in and of itself, hence this writeup.

Those who overuse the "correlation is not causation" line often have little understanding for how a proper statistical study is actually conducted. For an example, see this discussion on Slashdot. It's about a recently published study which generally concludes that a few drinks a day is healthy, or at least not too unhealthy. Here's one comment that was highly moderated (e.g. approved generally by the community, which in the case of Slashdot consists of a reasonably intelligent mix of mostly male geeks):

The Old Correlation-Causation Confusion

Well, that would be *excellent*, I love a glass of wine or three a day. A beer or two on a hot day is just heavenly. But unfortunately the correlation may not imply causation. i.e. people who live longer drink more, but not vice-versa.

Maybe really sick people don't drink as much.

Maybe the people that have four drinks a day have to be quite healthy to keep that up day after day after day.

Maybe drinking keeps them off the streets, or out of other dangerous places.

Maybe all the 4-drink-a-day people have died already and were not around for a survey.

Lotsa possible ways to spoil things.

Another highly moderated comment:

Stats 101...

Correlation does not imply causation. All we can say is that "people who drink a bit of alcohol tend to live longer," not that alcohol prolongs their lives. It could be that these individuals take the time to socialize and de-stress, which causes them to live longer. Or perhaps there are financial factors at play: someone who can afford to drink three or four bottles of wine a week is not likely to be living in abject poverty. Of course, it could also be that anti-oxidant properties of the beverages have a positive effect as well.

It is worth noting that there was actually a reasonably insightful reply to the above comment, and I will essentially expand upon what it said here. Both of the above comments, despite their erudition in using the scholarly-sounding terms "correlation" and "causation", are actually a display of general statistical ignorance. Upon examining the news report about the study, it becomes clear that this is not the sort of result that can be so casually dismissed. A key excerpt:

Their conclusion is based on pooled data from 34 large studies involving more than 1 million people and 94,000 deaths.

This was a very large study, and the scope of it suggests to me that those who conducted it are likely well aware of the issues of correlation and causation and that the former does not necessarily imply the latter. In fact, typical statistical methods (including the ones likely used in this study) are built explicitly to help control for these issues. Newsmen and pundits may make the correlation-causation fallacy, but someone who has spent years studying regression analysis is unlikely to. This is not to say that all academic statistical work is flawless - in fact, the more of it I see, the more flaws I see. However, the mistakes are often much like the work itself - very complex. One generally cannot dismiss an academic study with one sentence and a few logical fallacy terms (there are some situations where you can, but I don't think this particular study is one of them).

Don't worry though, I'm not going to just wave my hands here and expect you to believe me. Here is roughly how statistical studies control for the issue of correlation versus causation, among other things: first, it all comes down to your data, and your data depends on your sampling technique. Here they used a pooled sample, combining the results from 34 previous studies into a quite tremendous one million person sample. While we do not know how the individual studies were conducted, it is safe to say that with such a large total sample it should be possible for competent researchers to build a sample generally representative of the total population. That is, given that the world demographics are known (roughly 50/50 gender split, a generally bell-shaped distribution for age as there are few babies and few really old and mostly in the middle, etc.), the researchers can pick and choose data based on these characteristics to have results which can better model back on to reality.

Of course, the researchers should randomly pick and choose based on these factors, and not based on other ones (for a health related study, it would bias your results a great deal to choose a sample based on preexisting health conditions, e.g. study exclusively healthy or exclusively sick people). And this brings me to another technique of sampling - random sampling. If you cannot collect such a large sample as to allow you to construct a balanced sample, then you can simply choose people at random, thus normalizing all other factors. If done properly, such a study can have pretty good statistical power with a sample as small as a thousand people. True random sampling is increasingly difficult in the modern day, though - the sorts of people who will respond to surveys and studies are different than those who won't, and that alone will bias your sample.

Now, why does having a balanced sample (constructed or random) help with the correlation-causation fallacy? In the words of Sherlock Holmes, "when you have eliminated the impossible, whatever remains, however improbable, must be the truth." That is to say, if your study is adequately controlled for possible external "C" factors (as discussed earlier), then it is reasonable to conclude that the relationship between A and B is causal (though as said earlier, the direction of the cause is another issue).

In the case of a medical study, that means controlling for characteristics that would be pertinent in terms of health. If you're studying the effects of alcohol, you don't want to survey just healthy people or just sick people, but rather a suitable mix of both. In the field of political science, controls have more to do with, well, political, social, and economic issues. If you want to argue, in international relations, that democracies do not go to war with each other (the "democratic peace", common in both academic papers and presidential speeches), then you would do well to control for GDP (to defend against the argument that democracies simply happen to be wealthy and it is wealthy countries that avoid mutual war, on account of the prohibitive cost of suspending trade and disrupting industry). Of course the argument gets more complex if somebody asserts that democracy causes wealth and then wealth causes peace - while this may somewhat save the "democratic peace", the causal chain must be defended from possible alternative explanations at each link.

Some issues are so slippery, with so many possible causes, that it is very difficult to get statistical traction on them. There are sophisticated methods to help with this, most of which I only have a vague understanding of at the moment (check back in a few years). But yes, some assertions are beyond reasonable testing, particularly when you cannot control the behavior of your objects of study (that is, you are not in an experimental setting such as a laboratory but are rather trying to observe real-world issues such as war). A currently hot field in political science is to try to use a more experimental approach, and this is somewhat possible in domestic politics or public opinion issues where you can take steps to affect your object of study. In the case of international relations, though, it is unlikely that academics will ever be able to tell countries to go to war or not simply in the name of science.

And so, the bottom line is that it is still quite reasonable to be suspicious of statistics, especially when they are being cited by the media and/or politicians. Even when a study is valid, the results are often twisted by an intern who just read the abstract and decided it would make good political fodder in a campaign ad. But just as correlation does not imply causation, suspicion should not entail dismissal - be cautious, but still give studies a decent thinking-through at the least before concluding they are either right or wrong. Ask yourself these questions:

Did they build their sample in a reasonably unbiased way?
Is there a clear mechanism to explain why their independent variable(s) leads to their dependent variable?
Have they accounted for any superior alternative explanations that I can come up with?

If the answer to all three of these is "yes", then it is reasonable to say that the study is accurate. If the answers are more mixed, well, then deal with them as the situation justifies.

Thank you for reading.

My main reason for posting this is posterity - the blog I originally posted this on is no longer online. I also find parts of it (most notably the bits with Slashdot) amusingly anachronistic.

But all in all, I do like the fact that what I wrote still seems sensible to me four years later. It's also still relevant - as was joked in last weeks entry, programmers often pretend they are good at math (for example by exclaiming "correlation does not equal causation!"), but in reality that is not their main strength.

Causality is always difficult to tease out in any situation, statistical or not. A rather strong philosophical case can be made that causality is always uncertain at best (though practically speaking one can usually be pretty sure). But the whole point of regression analysis is to control for these spurious variables and get at the underlying mechanism - it's not foolproof, but it's not foolhardy either.

Sunday, November 7, 2010

Learn Python The Hard Way - a review

Every programming language has some kind of way of doing numbers and math. Don’t worry, programmers lie frequently about being math geniuses when they really aren’t. If they were math geniuses, they would be doing math not writing ads and social network games to steal people’s money.

This is one of the many gems contained in Learn Python The Hard Way, a free book that claims to teach you how to program "the way people used to teach things." It contains 52 exercises (and essentially nothing else), and argues that learning to program is similar to learning to play an instrument - it requires practice and brute repetition of truly monotonous tasks until you're able to easily detect details and spot differences between code.

One of the specific tenets? Do not copy-paste. I find this one to be interesting, as my own history as a computer tinkerer is full of a great deal of copy-pasting and tweaking. This book argues that you should manually retype exercises to condition your brain and hands, much like playing scales and arpeggios is critical to learning an instrument. I see the appeal, and wonder how I would be different if my background was less copy-paste oriented - I still think that copy-pasting reduces the entry barrier to do certain things, allowing you to explore more freely and cover more material. For the purpose of this book though I agree with his "no copy-paste" edict.

So how about the exercises? The book warns that if you know how to program you will find the exercises tedious, and that is true most of the time. I'm not talking "Hello World!" tedious here, but most exercises are very rote and focus on repeatedly manipulating single aspects of the language to understand subtle variations in output. The instructions are to "write, run, fix, and repeat" - not to meander around the language browsing random source code, packages, and other things as most who start programming do.

The exercises finally get a bit freer towards the end (you're supposed to make a simple "choose your own adventure" game), but all in all the book lives up to its title - not really in difficulty, but in strictness and philosophy. It's unfortunate that the humor and appeal of this book will mostly be towards those who already program rather than true "newbies." Both stand to gain from this book, but the latter will be more diligent in working through it as the concepts in the exercises will be new to them. Those with experience programming will skim, skip, and yes, most likely copy-paste, if just to be contrarian.

All in all, this book reminds me in many ways of Why's (Poignant) Guide to Ruby. Both have a sense of humor, and both also have a distinctive and coherent philosophy towards programming and learning. Of course superficially they're quite different, and Why's guide seems to lean more towards the free "play around with things and have fun" approach (I doubt he'd argue for no copy-paste) while the Python book has an almost disciplinary attitude.

Both, though, are fun and educational to read. Abstracting from the books, I still have to recommend Python as more useful than Ruby overall, at least to my knowledge. Ruby is great fun, but is still catching up in terms of speed and scaling while Python can have truly "industry grade" applications. Of course if you're just doing a fun project for yourself it really doesn't matter, but if you're working or seeking employment then Python is probably a better bet (or of course Java/C/C++, but those are all different beasts).