Causality and deep learning's reliance on IID
Part 1: A primer on IID, exchangeability, and Bayesian causality.
Deep learning is the cutting edge of machine learning. It has been responsible for many recent breakthroughs in machine learning benchmarks.
Recently a paper from Yoshua Bengio (a famous deep learning researcher) and other impressive people from Google, Mila, and Max Planck characterizes the limits of deep learning and lays out an agenda for causal machine learning research.
The paper argues that the core problem is that the characteristics of the data where you apply the algorithm often drift away from that of the data where you trained the algorithm. You get cases of self-driving car algorithms making mistakes a human could easily avoid, despite being trained on millions of miles of road data.
They cast the problem as making an erroneous assumption about the data.
Machine learning often disregards information that animals use heavily: interventions in the world, domain shifts, temporal structure — by and large, we consider these factors a nuisance and try to engineer them away. In accordance with this, the majority of current successes of machine learning boil down to large scale pattern recognition on suitably collected independent and identically distributed (IID) data.
Assuming the data is IID is a common assumption in statistics and machine learning. Statisticians are quite intentional about this assumption because all those “statsy” things they care about (confidence intervals, p-values, bias/variance of estimators, etc.) rely heavily on that assumption.
But the authors are right in that among applied machine learning practitioners, IID is often taken for granted, to the detriment of the robustness of the algorithm in new settings.
So I thought I’d take a hiatus from the crypto posts (going to revisit shortly) and spend a few posts talking about the IID assumption and its close cousin exchangeability in simple terms, then discussing their implications to machine learning and causal inference.
Independent and identically distributed random variables explained.
Suppose I flip a coin three times and cover the result in a cup so that each coin flip is a random variable -- we would only observe the value each random variable takes when lifting the cup.
This sequence is independent and identically distributed (i.i.d.).
Identical. Each flip is happening the same way. The probability is the same in each case.
Independent. Lifting a cup and observing the outcome of one flip will tell you nothing else about the outcomes of the others.
Let's consider other ways we could make a sequence of coin flips that would not be i.i.d.
Identical but not independent. Flip the coins such that the outcome of each flip depends on the previous outcome.
Independent but not identical. Two out of three coin flips are weighted such that they have a probability of .6 of drawing heads, but the other has a probability of about .35.
Not identical, not independent. Two out of three coin flips have a different probability of flipping heads than the other. But unlike in (2), the probabilities are unknown.
In the first case, the coin flips are identical (if we take the first flip as given). They are all P(X|X-previous). But they are clearly not independent.
This first case is an example of when the order of the data matters, such as in time series or spatial data.
In the second case, the flips are independent but not identical. They are independent because each is still just a coin flip from a known probability. They are not identical because the probabilities are different.
In the third case, the flips are neither independent nor identical. The reason they are not identical is the same as the previous case.
The reason they are not independent is trickier. The probabilities are unknown. When you don’t know the probability of flipping heads, but you know some observed coin flips had the same probability of flipping heads, then you could try to use those other coin flip outcomes to predict the outcome of the coin flip in question. That is the very definition of dependence.
An idea related to IID is exchangeability. Suppose again that I flip a coin three times. Before you see each outcome, I hide it under a cup.
Suppose I offer you a $10 bet that will pay you $100 if you get the sequence H-H-T?
Would you take the bet? FYI your expected return is positive ($2.50), unlike any game you'd see in a casino.
Now, what if, before you saw the outcome of the flip, I covered the coins in a cup and randomly shuffled the order. Note that if the initial outcome was indeed H-H-T, it probably isn't now.
Should you change your bet?
I hope your intuition says no. The word for that intuition is exchangeable; used in a sentence, you say, "this sequence of random variables is exchangeable."
Exchangeability means exchanging the positions of the variables in the sequence's order doesn't affect the probability distribution over the sequence.
In this case, the sequence is exchangeable because they are IID. The two are not synonymous; it is possible to have non-IID. sequences that are exchangeable, as we will see.
Making assumptions about data
IID and exchangeability are properties of a random process that generates a sequence of random variables.
In practice, when we get data, we don’t know the process that generates the data. So we can’t say for sure if that data is IID or exchangeable. So we assume it is.
In the following posts, we’ll evaluate what those assumptions do for us and what happens when those assumptions are wrong.
Lastly, please help support this newsletter by sharing this post.
Schölkopf, B., Locatello, F., Bauer, S., Ke, N.R., Kalchbrenner, N., Goyal, A. and Bengio, Y., 2021. Toward Causal Representation Learning. Proceedings of the IEEE.