Collider Bias

posted 04-10-2024

While reading Judea Pearls's book Book of Why I came across a statistical pitfall that I think doesn't get enough attention.

You've probably heard of spurious correlations in statistics. There is the common slogan "correlation does not imply causation". This is simply enough to see. Take the case of two variables X and Y which both share a common cause Z. An example is when it is raining, the streets are wet and people are likely to be carrying an umbrella. But of course the two variables X holding an umbrella and Y streets are wet are not causally connected. One does not cause the other. The spurious correlation can be ameliorated by controlling on the variable Z. That simply means to hold constant in some way the value Z and consider the variables X and Y in a "slice" of Z.

So far so good. But there is another less well known causal error that is in some ways the dual to the above. Collider Bias, also known as Berkson's Paradox, happens when two variables X and Y both causaully contribute to a third variable Z. When we control for the variable Z, we get a spurious anti-correlation between X and Y! Think of it like arithmetic. If we have an equation Z = X + Y with Z constant, then when X is almost as large as Z, then Y has to be small to make the equation work, and vice versa. Here is an example from the wiki article. Suppose the variable Z is level of fame and X is talent and Y is attractiveness. Then examining famous people one might observe attractive famous people tend to be less talented and talented famous people tend to be less attractive.

Collider bias demonstrates the potential pitfalls when controlling for a variable. It pops up in all sorts of areas. It is observed in real life medical studies. For instance investigators studying mothers health, smoking, and infant birth weight. In controlling for infant birth weight, they found a surprising (spurious) correlation between smoking and the mother's health!